---
title: "Introduction to R Part II"
author: "Marcos Sanches"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
---
```{r setup, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
# Objective
Last week we wanted to get you more famliar with datasets, and how to access elements (subjects, variables, subsamples) within a dataset using base R (meaning, no package). Today we want to go further and introduce you to some important functions for those who do statistical analysis. We will stil stick to base R.
We strongly believe that giving you some knowledge about base R, even if those things can be done in an easier way through a package, will provide a background that makes it much easier for you to understand and use packages, or piece of scripts that you find on the internet. We are teaching you the foundation given that we cannot teach you each package (there are thousands!).
# A new RStudio project
We want to remind you again how to create a new RStudio project, as we dont want this to be any barrier for you.
1. Create a folder for each data analysis project.
2. Open Rstudio.
3. Go to File > New Project > Existing Directory
4. Navigate to the folder you created
5. Click Create Project. A “…Rproj” file will be created in the folder. Next time, you double click this file to open your project.
6. Go to File > New File > R Markdown (or R Script) and click Ok.
7. Save the new R Mardown file. Here you will store your R script.
8. Make sure you store all your scripts. Ideally, you want to have in scrip all single steps from reading the data to final results.
So, at this point we assume you saved all the workshop files in a folder (maybe named Introduction to R Part II), and we will create a new project there for today's workshop.
# Cleaning R environment
Again we clean the R environment - we can do that because we are confident the script we will create will reproduce all relavant analysis, we dont need to keep any dataset or analysis results other than the original dataset!
```{r}
rm(list = ls())
```
# Reading our CSV file
We have been through this, so we will not spend much time here. A CSV file can be read directly into R using *read.csv*.
I also like to always inspect variable names and the top lines right away.
```{r}
dt <- read.csv("Self Esteem Dataset.csv")
names(dt)
#Fix issue with name of ID variable.
names(dt)[1] <- "id"
# Take a look at the data
head(dt)
tail(dt)
```
# Factors in R
More than in orther software, in R we need to explicitly think about the type of the variables when we do statistical analysis.
*Continuous variables* - These are variables that you can do math with. Like age, BMI, PHQ-9 scores... Variables that contain only numbers will be imported into R and be used in statistical analysis as continuous variables. However, if the numeric codes are placeholders to categories, then you need to transform them into a *factor* type for many of your anayses.
*Categorical Variables* - These are variables that have levels, like Marital Status, Gender, Age Groups, etc. When a variable that has numeric values as placeholders for levels is imported into R, it is good to transform it into a factor so that R dont use it as continuous variable in statistical analysis.
*Ordinal Variables* - These are categorical variables that have an order, like age and income groups, Likert scales. You should deal with them as per how you want to treat them in the data analysis.
R will many times try to fix your variable if it is of the wrong type. But it is only a little bit smart and it is better if you make sure yourself that all variables are specified as they need for the analysis. We will give some examples using our dataset.
```{r}
# lapply applies to each component of dt the function class.
# We see that all variables are integers, except for country, which is character.
# integer variables are treated as continuous in statistical analysis.
lapply(dt, class)
```
So, gender is not Numeric. Lets fix it! We need to use the function *factor* to transform it into a categorical variable. This is a very useful function in R. We will create a new gender variable that is a factor.
But first lets review the funciton *table* which is similar to frequency in SPSS and also very useful. Before transforming gender, we want to know what its codes are.
```{r}
table(dt$Gender)
```
We only have codes 1 and 2, which according to the codebook are Male and Female respectively.
By default *table* does not count NA (missing) but you can force it to.
```{r}
table(dt$Gender, useNA = "always")
```
There are not missing values. Even though you could use gender as is in a regression model, say, and it would be treated as continuous, which is fine when a variable has only two levels, the good practice is to transform it into a factor to make the output more readable, less confusing.
```{r}
# We tell R what values are associated with whihc labels and the order they
# should be displayed in the analyses. Here, Females will come first, then Males.
dt$fgender <- factor(dt$Gender, levels = c(2,1), labels = c("Female","Male"))
# We then check with a two-way table
table(dt$fgender, dt$Gender)
# fgender is a factor
class(dt$fgender)
```
Lets do it again with Q1. In this case we have some missing! Lets try different things to see how *factor* works. Trying things and looking what happens is a bit part of lerning R!
```{r}
table(dt$Q1, useNA = "always")
# Here we create a facotr but dont add labels to its levels.
dt$fQ1.1 <- factor(dt$Q1)
table(dt$fQ1.1)
# Adding levels to it. By not specifying the levels, R will use the numeric order.
dt$fQ1.2 <- factor(dt$Q1, labels = c("Str.Disagree","Disagree","Agree","Str.Agreee"))
table(dt$fQ1.2)
# And here we change the order the factor variable is displayed.
dt$fQ1.3 <- factor(dt$Q1, levels = c(4,3,2,1),
labels = c("Str. Agree","Agree","Disagree","Str. Disagreee"))
table(dt$fQ1.3)
# And here NA becomes a new factor level
dt$fQ1.4 <- factor(dt$Q1, exclude = NULL,
levels = c(NA,4,3,2,1),
labels = c("MISS","SD","D","A","SA"))
table(dt$fQ1.4)
# And here as an ordinal variable. You may need this in some models.
dt$fQ1.5 <- factor(dt$Q1, ordered = TRUE)
table(dt$fQ1.5)
class(dt$fQ1.5)
```
## Function cut
As we saw last week, the funciton *cut* is very useful to create factor from continuous variables. Here is it again.
```{r}
dt$age_group <- cut(dt$Age,
breaks = c(14,20,30,40,50,120),
labels = c("15 to 20","31 to 30","31 to 40","41 to 50","51 +"))
table(dt$Age, dt$age_group)
```
# Descriptives
These are simple functions. But notice that they dont deal with missing, you have to explicitly tell them to ignore misisng values.
```{r}
# R does not know how to calculate mean of missing values. The result is NA.
mean(dt$Q1)
# Telling R to remove NAs from mean calculation.
mean(dt$Q1, na.rm = TRUE)
# Same for Standard Deviation
sd(dt$Q1, na.rm = TRUE)
# Correlation too you have to specify what to do with NA, but differently.
cor(dt$Q1, dt$Q2, use = "complete.obs")
# and other functions.
median(dt$Age, na.rm = T)
min(dt$Age, na.rm = T)
max(dt$Age, na.rm = T)
quantile(dt$Age, c(0.25,0.5,0.75))
```
## Summary
This is also very useful and general. It will give summaries of different types, depending whether you appy it to a dataset, a regression model or somethign else.
Notice how *summary* treats variables according to their type.
```{r}
summary(dt)
```
## Hmisc package
There are many packages you can use for descriptive statistics. To find them, you have to search the internet. In the R website they have a task view page (https://cran.r-project.org/web/views/), which can help find packages, but they dont have descriptives statistics, for example.
```{r}
# install.packages("Hmisc")
# Here we use funciton describe, in Hmixc package, without loading the package.
Hmisc::describe(dt)
# That was a big output, lets look at Age only.
Hmisc::describe(dt$Age)
# And age group
Hmisc::describe(dt$age_group)
```
## pastecs package
Here is another package. It is just to give you an example of packages.
```{r}
# install.packages("pastecs")
pastecs::stat.desc(dt)
# options(scipen=999)
```
# Simple graphs
A sample of graphs in base R. Use help and search the internet to see the different graphs and how to format these graphs. However, two weeks from now we will learn ggplot, which is a package that allows you to create most graphs you need.
```{r}
plot(dt$Age)
plot(dt$id, dt$Age)
hist(dt$Age)
barplot(table(dt$country))
barplot(table(dt$country, dt$age_group))
barplot(table(dt$country, dt$age_group), beside = TRUE)
boxplot(dt$Age)
boxplot(Age ~ country, data = dt)
```
# Proportion in tables
Base R tables are a bit cumbersome, although we can get proportions and other things from them.
```{r}
t <- table(dt$age_group,dt$country)
t
prop.table(t,1) # 1 means row pct (add to 1 in each row)
prop.table(t,2) # 2 means column pct (add to 1 in each column)
```
# Many packages...
Here is just an example of tables in R like the SPSS one usig the *gmodels* package
```{r}
#install.packages("gmodels")
library(gmodels)
CrossTable(dt$age_group,dt$country, data = dt)
# Just columns percent
CrossTable(dt$age_group,dt$country, data = dt,
prop.r=F, prop.t = F,prop.chisq = F)
# Just columns percent and format like SPSS
CrossTable(dt$age_group,dt$country, data = dt,
prop.r=F, prop.t = F,prop.chisq = F,
format = "SPSS", digits = 1)
```
# Bonus: R programming
Some extra material that we will probably not cover in the workshop.
## R functions
It is not unusual for R users to write their own functions, which customizes takes they need to do. For example, maybe you would like to have a function that calculated mean and standard deviation at the same time, so that you don't need to use two functions.
```{r}
# mean and standard deviation
mean(dt$Q1, na.rm = T)
sd(dt$Q1, na.rm = T)
mean_sd <- function(x)
{
m <- mean(x, na.rm = T)
s <- sd(x, na.rm = T)
}
# Nothing happens because what is done inside a function stays there
mean_sd(dt$Q1)
mean_sd <- function(x)
{
m <<- mean(x, na.rm = T)
s <<- sd(x, na.rm = T)
}
# The operator "<<-" will create m and s in the environment outside the function
mean_sd(dt$Q1)
m;s
# Or you can ask the function to print results.
mean_sd <- function(x)
{
m <- mean(x, na.rm = T)
s <- sd(x, na.rm = T)
print(m)
print(s)
}
mean_sd(dt$Q1)
# nicer
mean_sd <- function(x)
{
m <- mean(x, na.rm = T)
s <- sd(x, na.rm = T)
cat("Mean = ",m," \nStandard Deviation = ",s)
}
# "cat" will concatenate AND print.
mean_sd(dt$Q1)
# Alternatively, functions can return values.
mean_sd <- function(x)
{
m <- mean(x, na.rm = T)
s <- sd(x, na.rm = T)
x <- c(m,s)
return(x)
}
mean_sd(dt$Q1)
# We can use list to return an object that is easier to work with.
mean_sd <- function(x)
{
m <- mean(x, na.rm = T)
s <- sd(x, na.rm = T)
x <- list(mean = m,sd = s)
return(x)
}
mean_sd(dt$Q1)
a <- mean_sd(dt$Q1)
a$mean
a$sd
````
## The "ifelse" command
Conditionals are very useful because we often want to perform tasks that are conditional on something. Like, if 0 then recode to 10, otherwise do nothing. This is a specific conditional used a lot to create/ recode variables. See below for the condional used in programming in R.
```{r}
# A vector.
x <- dt$Q1
table(x, useNA = "always")
# Recoding 1 to NA, otherwise keep the same.
y <- ifelse(x == 1,NA,x)
table(y, useNA = "always")
# Recoding 1=NA, 2 to 4 = 0, anything else goes to 1.
y <- ifelse(x == 1,NA,
ifelse(x > 1,0,1))
table(y, useNA = "always")
# Recode into a categorical variable.
y <- ifelse(is.na(x),NA,
ifelse(x < 3,"1-2","3-4"))
table(x,y,useNA = "always")
# Add to dataset. We can use cbind:
dt = cbind(dt,y)
head(dt)
# DOing it directly
dt$cat_Q1 <- ifelse(is.na(dt$Q1),NA,
ifelse(dt$Q1 < 3,"1-2","3-4"))
head(dt,10)
```
## The IF and ELSE commands
The **if/else** conditional is not applicable to vectors or data columns, it can be applied to individual values only.
```{r}
q = 2
if(q == 0){
w <- NA
} else if (q <= 2){
w <- "1 to 2"
} else if (q >= 3){
w <- "3 to 4"
}
w
```
## The for loop
Imagine that we want to use **if/else** to recode our variable Q1 instead of using **ifelse**. We would have a problem because **if/lese** applies to individual values only, but Q1 is a data column. What we could do is to loop through the values of Q1 recoding them one by one.
```{r}
for (i in (1:nrow(dt))) { #a loop that goes over each line of dt.
if(is.na(dt$Q1[i])){
dt$w[i] <- NA
} else if (dt$Q1[i] %in% c(1,2)){
dt$w[i] <- "1 to 2"
} else if (dt$Q1[i] %in% c(3,4)){
dt$w[i] <- "3 to 4"
}
} # end for loop
head(dt,10)
table(dt$w, dt$cat_Q1, useNA = "always")
# A way of doing things so that we dont need to specify the object name all the time.
with(dt,table(w, cat_Q1, useNA = "always") )
# In comments because attach seems not to work with rmarkdown and knitr.
# Or we can 'attach' the dataset so that its components (variables) become available as objects.
#attach(se)
#table(w, cat_Q1, useNA = "always")
#detach(se)
```
## The APPLY family of function
These are very popular function in base R that replaces **for** loops efficiently. Using **apply** functions is considered to be more elegant and efficient than using **for** loops, but it is not as easy to understand. In order to master them you will probably have to use them a lot, and challenge yourself to use an **apply** function instead of simpler **for** loops.
### APPLY(X, MARGIN,FUN)
X = Matrix or dataframe
MARGIN = ROWS or COLUMNS
FUN = Function to be applied to the rows or columns
```{r}
names(dt)
# Calculate the mean of each column in se.
apply(X=dt[,2:11],MARGIN=2, FUN= mean )
apply(dt[,2:11],2, mean )
# Removing NAs
apply(dt[,2:11],2, mean, na.rm = T )
# Calculate the number of NAs for each variable.
x <- function(x){ sum(is.na(x))} # A simple funciton that sum the NAs in x, whatever is x.
apply(dt, 2, x) # we give the function to apply.
# and we could also have created the function within apply
apply(dt, 2, function(x) sum(is.na(x)))
```
Here we calculate a new variable which is the sum of the scores from Q1 to Q10.
```{r}
self.esteem <- apply(dt[,2:11],1,sum, na.rm = T)
dt <- cbind(dt, self.esteem)
head(dt)
```
## LAPPLY(X,FUN)
The lapply function will apply the FUN to the elements of X. Here you dont need to specify if you want rows or columns. This function tends to work better with dataframes than the apply function which is designed for matrix object type.
```{r}
# Apply does not like the fact that some variables are numeric and so it does not work well.
apply(dt,2,mean, na.rm = T)
# lapply will work with the elements of the dtaset, which are its variables.
lapply(dt, mean, na.rm = T)
# The result of lapply is a list of items.
x<- lapply(dt,mean, na.rm = T)
class(x)
unlist(x) # we can force x to be a vector rather than a list.
x <- x[!is.na(x)] #remove NAs from x.
x
m <- t(as.data.frame(x)) #force x to be a data.frame, then transpose it.
m
m <- as.data.frame(m)
m
names(m) <- "Mean"
```
## SAPPLY(X,FUN)
THe output of lapply is a list, which is rarely the thing we want to work with. We prefer working with matrix, vector, dataframe... So, sapply will try to convert the result given by lapply into a vector/array.
```{r}
lapply(dt, mean, na.rm = T)
sapply(dt, mean, na.rm = T)
x <- sapply(dt, mean, na.rm = T)
class(x) # x is a numeric vector, not a list.
```
That is it for today! News week we will look into a specific package that is very popular" *dplyr*.