---
title: "Introduction to R Part I"
author: "Marcos Sanches"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
---
```{r setup, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
# Objective
We have worked a bit with data last week. This week we will work more with datasets. The goal is that you get more comfortable with base R, and how to work with datasets. We will also intruduce the concept of R packages.
# A new RStudio project
Let's first remind the useful steps to create a new RStudio project.
1. Create a folder for each data analysis project.
2. Open Rstudio.
3. Go to File > New Project > Existing Directory
4. Navigate to the folder you created
5. Click Create Project. A “…Rproj” file will be created in the folder. Next time, you double click this file to open your project.
6. Go to File > New File > R Markdown (or R Script) and click Ok.
7. Save the new R Mardown file. Here you will store your R script.
8. Make sure you store all your scripts. Ideally, you want to have in scrip all single steps from reading the data to final results.
So, at this point we assume you saved all the workshop files in a folder (maybe named Introduction to R Part I), and we will create a new project there for today's workshop.
# Cleaning R environment
We probably can that the best script in terms of reproducibility is the one that takes you from reading the data all the way to the final results, without the need for doing anything outside R or manually.
If our script does that, we can aways start with a clean environment, that is, ignoring all that we did before, because we can redo everythign by rust runing the syntax.
```{r}
rm(list = ls())
```
# Reading Data into R
R is very flexible and can read many data formats. RStudio will help you with that, using necessary packages. In this workshop we will show you how to do it through RStudio and also through R script.
One thing to notice is that if you have a RStudio project in the same folder where your datsets are, you dont need to specify path addresses to read your data. R will by default read and save anything on that folder.
## Reading CSV data
We will first read the **Self Esteem Dataset.csv** dataset into R. This is a very common file type, which is just regular text, where variables are separated from each other by a comma (CSV stands for Comma Separated Values). R has a base function to read CSV datasets (base means that there is no need for special package, it is available to you in R when you open it). Lets open, and quickly inspect the data usig the **head** function.
Notice that we import the data and save it in a object "data.csv", whcih we can then use in other functions whenever we want.
```{r}
data.csv <- read.csv("Self Esteem Dataset.csv")
head(data.csv, n=20)
names(data.csv)
names(data.csv)[1] <- "id"
head(data.csv)
```
The function *read.csv* has several arguments that may be useful. We can use the R help to take a look at them.
```{r}
# ? read.csv
```
## R Packages
We will next read an Excel dataset, and in order to do that we will need a specific R package.
You can think of R packages as a bunch of R scripts that people who are good at programming put together to do thing that R still does not do.
IN order to use a package you need to
1. Install it.
2. Load it.
3. Use it.
Here is an example. We will install the package *readxl* then load it.
```{r}
# install.packages("readxl")
library(readxl)
```
That is it. Now we can use the function *read_excel* from the package *readxl*.
**Note** - Once you install a package, usually you will not need to worry about installing it again until you change your version of R. When you load a package, its functions will be available untill you close RStudio section. Next time you open RStudion you will need to load the packages you need again, but you will not need to install them again.
## Read Excel Data
You can do this directly via RStudio menu. However, it is always important to keep the script for reproducibility.
```{r}
data.xl <- read_excel("Self Esteem Dataset.xlsx")
head(data.xl)
# help(read_excel)
```
## Read SPSS Data
Like for Excel, to read SPSS dataset we also need a special package called *haven*. If it is not installed you will need to install it.
```{r}
#install.packages("haven")
library(haven)
data.spss <- read_sav("Self Esteem Dataset.sav")
head(data.spss)
```
The *haven* package keeps variable and value labels from SPSS, which can be useful
```{r}
attributes(data.spss$Q1)
```
# Dataframes
Objects in R have types, which we can access using the function *class*. The *data.frame* type of object will be very important for us, because these are the type of objects the datasets are.
But in general the type of object will always be important. Things like a variable within a dataframe, or a regression output, or any other object in R will be of a specific type, which will define how it is used. For example, the function *residauls* can be applied to an object of type *lm* (linear model, regression), but not to an object of the type *data.frame*.
By clicking the data.frame object in the RStudio environment, you can visualize it in a spreadsheet format.
```{r}
class(data.csv)
class(data.xl)
class(data.spss)
# Functions in R that can be applied to a data.frame objects.
methods(class = "data.frame")
```
## Displaying the data
By clicking the data.frame object in the RStudio environment, you can visualize it in a spreadsheet format. If you are a just learning R now, this will be a very useful tool. But R has also some other helpful functions.
```{r}
data.csv
head(data.csv, n = 20)
tail(data.csv, n = 20)
nrow(data.csv)
ncol(data.csv)
dim(data.csv)
```
## Variables
Variables can be understood as the names of each column in a dataset. They are important, because they are the targets of all analyses we do - we will use variables from datasets for everything.
To refer to a variable in R you can use the **$** operator, or its position.
```{r}
# list all variables
names(data.csv)
# Extracting variable Q4 using $ operator
data.csv$Q4
# Extracting variable Q4 using its position
data.csv[5]
# Extracting variable Q4 using its name
data.csv["Q4"]
# Variables Q4 and Q10
data.csv[c("Q4","Q10")]
data.csv[c(5,11)]
# Consecutive variable
data.csv[3:7]
```
## Values
We can apply what we learned above to access values.
```{r}
# first to tenth values of variable country.
data.csv$country[1:10]
# If we want to use the position of the variable, we need to work with two dimensions: Rows 1 to 10 and column 15.
data.csv[1:10,15]
data.csv["country"][1:10,1]
data.csv["country"][1:10,]
# Row 5 to 7, all columns
data.csv[5:7,]
# Columns 3 to 10, all rows
data.csv[,3:10]
```
## Missing values
Missing values are coded as **NA** in R. The function *is.na* can be useful to handle them. Different functions to handle missing exist in different packages and statistical analyses. Most of the time R will simply drop cases with missing values from the analysis, but sometimes it may just display an error indicating that it expects you to handle the missing data explicitly before doing the analysis.
```{r}
# There is a missing value in Q4
data.csv$Q4
# is.na function indicates where there are NAs. More on this function below.
is.na(data.csv$Q4)
```
## Spliting the data
You can use all the above to split the data. We are basically doing the same thing, and storing the results in a new objects. Although here we will show you some examples of conditional statements, which are common when we want to split the data.
```{r}
# new dataset that is composed by only top 20 lines of data.
dt <- data.csv[1:20,]
nrow(dt)
# Or we can select only some columns
dt <- data.csv[,1:11]
dt
# Conditional Statement - Individuals younger than 40 years old.
dt <- data.csv[data.csv$Age < 40 & !is.na(data.csv$Age),]
dt
dt$Age
# Conditional Statement - Males
dt = data.csv[data.csv$Gender == 3,]
dt
# Conditional Statement - Males in Canada
dt <- data.csv[data.csv$Gender == 1 & data.csv$country == "CA",]
dt
```
## Variable Type
Variables in a dataset will have their types (accessed with the function *class*) that tells you what you can do with that variable. The most common types are **numeric, character and factor**.
1. Numeric are numbers that can be used in calculation and statistical analysis as a continuous variable.
2. Character are all types of characters, including numeric, which in general cannot be used in statistical calculations.
3. Factor are variables with usually a small number of level, and usually not represented by numbers.
Note 1 - If you give R the wrong type (for example, a character type to be used in a regression model), R may try to convert it into a type it can use (will try to convert the character into a factor). R will do that in many situations, in an attempt to help you and run what you asked for rather than displaying an error message.
Note 2 - It is very important in R to understand the difference between numeric and factor types for regression analysis in general. We will see more of this when we get to the regression workshop.
```{r}
class(data.csv$Q4)
class(data.csv$country)
```
## Creating/Changing Variables
You can crete new variables and change existing variables by using the operators above to access the variables. We will show some examples.
In a latter workshop we will learn how this can be done using the package *dplyr*, but even if you end up using the package, it is still important to understand how you can do it in base R.
```{r}
# A new variable that is the sum of Q1 and Q2
data.csv$sumQ1.Q2 <- data.csv$Q1 + data.csv$Q2
names(data.csv)
data.csv
```
An important function is *factor*. It is common that variables are imported in R as numeric but we want to treat it as a factor. For example, variables gender and source should be factors but are numeric.
```{r}
class(data.csv$Gender)
data.csv$gd <- factor(data.csv$Gender, levels = c(2,1), labels = c("F","M"))
table(data.csv$Gender,data.csv$gd)
```
Another useful function is *cut*. It is useful when you want to categorize variables, like when you want to transform age into age_groups.
```{r}
data.csv$age_group <- cut(data.csv$Age,
breaks = c(14,20,30,40,50,120),
labels = c("15 to 20","31 to 30","31 to 40","41 to 50","51 +"))
tb <- table(data.csv$Age, data.csv$age_group)
# we want to see the entire table.
options(max.print = 1000)
table(data.csv$Age, data.csv$age_group)
```
## Sample
The *sample* function is one that is often quite useful. For example, maybe you have a dataset that is too large and you want to run some analysis in a small sample of it so that it does not take too long. It is also used in randomization.
```{r}
# sampling 3 numbers from seq 1, 2 3 ...100
sample(100,3)
# sampling 3 rows of a dataset.
sample(nrow(data.csv), 3)
# sampling and selecting 3 rows
dt <- data.csv[sample(nrow(data.csv),3),]
dt
# reproducible sampling - seeting a seed.
set.seed(111)
sample(nrow(data.csv), 3)
# if we do it again, we get the same sample. It is the only way to get the same sample.
set.seed(111)
sample(nrow(data.csv), 3)
```
## Complete cases
The function *complete.cases* flag subjects that have no missing values. We can use that to remove subjects with missing values in any variable, that is, listwise deletion of missing values.
```{r}
options(max.print = 50000)
# whether each row is a complete case or not.
complete.cases(data.csv)
# selecting the TRUE lines.
dt <- data.csv[complete.cases(data.csv),]
# The original data has missiing values.
data.csv[is.na(data.csv$Q4),]
# but not the new data.
dt[is.na(dt$Q4),]
```
# Saving data and objects
Datasets can ve saved in csv format. Specific packages may allow you to save in other specific formats.
The *saveRDS* and *readRDS* functions are useful to save any type of objects, not only dataframes.
The *save.image* function saves everything in your environment in a R file.
```{r}
# datasets can be saved easily in csv format
write.csv(data.csv,"Data after analysis in R.csv")
# A more general function is saveRDS, which can save any obejct.
# Fore xample, lets create a list with our datasets.
list1 <- list(data.csv, data.spss, data.xl)
# Now we can save the object list1, which is a list of datasets.
saveRDS(list1, "List with 3 datasets.RDS")
# Once saved, we read it again.
list1.a <- readRDS("List with 3 datasets.RDS")
# here we save the entire environment.
save.image(file = "My R Environment.RData")
# Now the environment is saved, we can delete everything
rm(list = ls())
# And read it again.
load("My R Environment.RData")
```
That is it for today! Thank you!