---
title: "Computing Fundamentals - Introduction to R"
date: "`r format(Sys.time(), '%d %B, %Y')`"
author: Marcos Sanches
output:
html_document:
theme: united
highlight: tango
toc: true
toc_float: true
toc_dept: 4
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = F, warning = F)
```
# R and RStudio
**R** is a statistical software and a programming language. As you open R (or RStudio), many statistical functions will be available for you to use.
**RStudio** is an **IDE** (Integrated Development Environment) for **R**. If R is installed in your computer, RStudio should automatically find and open it. The R console appears by default on the bottom left window of RStudio.
This is an introductory workshop and we hope to give you some basic background that will allow you to figure out things on your won, as you need. We thought that this is the most benefit we can get from the time we have. So, we will start by showing you how to **get a working R**.
# Before the Workshop!
## Option 1 - Install it in your computer!
1. **Install R in your computer**. Access the [CRAN website](https://utstat.toronto.edu/cran/), download and install R according to the system you have. If you have Windows, click in **Download R for Windowns**, then choose **base**, then doenload R by clicking in **Download R 4.0.3 for Windown** (or whichever version you see there!). Click the downloaded file and follow the instructions to install R. **Note** - If you don't have administrator rights over your computer, you may not be able to install and use R. In this case, go to Option 2 below.
2. **Install RStudio in your computer**. Access the [RStudio website](https://rstudio.com/products/rstudio/download/), download the free version that is appropriate for your system, and install it by clicking the downloaded file and following the instructions.
3. After steps 1 and 2, you should find an icon that opens RStudio in your desktop or in your Windows Start menu. You don't need to open it now, but you can use it to open RStudio and start a new analysis.
We have a [short video here](https://kcniconfluence.camh.ca/display/ED/Learning?preview=/1933823/4358293/Installing%20R%20and%20RStudio.mp4#Learning-InstallingRandRStudio) that also shows how to download and install R and RStudio. Notice that to access this video and many other resources, you will have to use your **CAMH credentials**.
4. **Create a folder** somewhere in your computer and name it "IntroR".
5. [Download the dataset "Self Esteem.csv"](https://kcniconfluence.camh.ca/download/attachments/1933823/Self%20Esteem.csv?version=3&modificationDate=1600183432679&api=v2) and save it in the "IntroR" folder.
6. [Download the R Markdown file "Intro to R.Rmd"]() and the [HTML file Intro-to-R.html"]() and save them in the "IntroR" folder. (Sorry, no link for downloads are supplied since you have already downloaded these files and is reading them!).
7. That is it! You should be ready for the workshop!
## Option 2 - Use RStudio Cloud
You can use R and RStudio on your web browser, no need to install them! Here is how:
1. [Download the dataset "Self Esteem.csv"](https://kcniconfluence.camh.ca/download/attachments/1933823/Self%20Esteem.csv?version=3&modificationDate=1600183432679&api=v2) and save it somewhere in your machine. We will use this dataset in the workshop. This is in our Confluence page and it will ask for your CAMH credentials.
2. [Download the R Markdown file "Intro to R.Rmd"]() and the [HTML file Intro-to-R.html"]() and save them in your computer. (Sorry, no link for downloads are supplied since you have already downloaded these files and is reading them!).
3. Go to the [RStudio Cloud](https://rstudio.com/products/cloud/) website and click **get started for free** button.
4. You will need to sign up with an email account, name, password.
5. To get started with RStudio, first click in **Your Workspace** on the pane on the left. Then on top right, click **New Project**
6. RStudio will open. Rename your project by replacing **Your Workspace/Untitled** at the top by **Your Workspace/IntroR**. You can think of this as your folder, where you will save all about your data analysis.
7. Click the button **upload** on the **bottom right** part of RStudio and upload the file "Self Esteem.csv" and "Intro to R.Rmd" that you downloaded in Steps 1 and 2.
8. You can sign off and close the page. In the day of the workshop, just sign into [RStudio Cloud](https://rstudio.com/products/cloud/) website again and click "Your Workspace" and "IntroR" project.
## Option 3 - Follow Along
We realize that it may be a bit difficult for you to do the activities in a Webex workshop as both screens will be in the same computer (You are watching us on Webex and simultaneously doing things on RStudio). So, if you feel it is easier, just sit back, relax and follow along. You would be able to replicate all we did on your own by using this markdown file.
# R Markdown warm-up
To get a new markdown document, go to File > New File > R Markdown, check "HTML" option and click "ok".
**R markdown** It is a way to create nice documents where you have your R code and its output, along with normal explanatory text. All in the same place. It makes it very easy for someone else to understand what you did and modify it, which makes your research very **reproducible**.
Rmarkdown allows you to add formatting to your text in a simple way. Here are some formatting tips.
# Header
## Header 2
### Header 3
This is a common text. *This is italic text*. **This is bold**. __this is also bold__.
* An item
+ subitem 1
+ subitem 2
- sub-sub item
A table
column 1 | improved | Did not improve
----------|----------|---------
treatment | 25 | 10
control | 15 | **1**
total | 40 | 21
# Some R scripts
Below we have a **chunk**. Anything inside the chunk will be interpreted as R code, that is, R will try to run it. Except when there is a hash-tag **#**, in which case R interprets as a comment and will not run it.
To run a line that is inside a chunk, you just need to put the pointer on that line and press **CTRL + ENTER**. Or go to **Run** on the top-right part of your markdown and select a running option there.
Lets run a few things and start getting a feeling for R!
```{r}
# Just print something on the R console.
print("Hello World!")
p <- "Hello World!"
p
# The seq function creates a sequence of values.
seq(from=1,to=100,by=5)
# But it is printed in the console and lost. Lets store it in object "a".
a <- seq(from=1,to=100,by=5)
a # Showing what is in a.
# Now we can use the sequence for whatever we like.
a /7
# Repeating values a certain number of times. It will overwrite object "a".
a <- rep(c(1,5),each=10) # Repeat 1 and 5, each 10 times.
a
# We can generate random numbers, which are important in simulations.
a <- runif(n=100,min=1,max=10)
# We can select a sample with replacement.
sample(c("A","B","C",NA),10, replace = T)
```
And there are many, many more things you can do! But we are interested in Statistics and Data Analysis. So, lets read our dataset into R and start having fun it it!
# Reading a dataset
We have a ".csv" dataset that contains the Self-Esteem questionnaire and some demographics. What we will use is a subset of the complete data which is [publicly available](https://openpsychometrics.org/_rawdata/). It is important to understand the data and we now provide a description of the variables in it.
Here are the working of questions Q1 to Q10 from Self-Esteem Scale (see Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press).:
> Q1. I feel that I am a person of worth, at least on an equal plane with others.
> Q2. I feel that I have a number of good qualities.
> Q3. (Reverse) All in all, I am inclined to feel that I am a failure.
> Q4. I am able to do things as well as most other people.
> Q5. (Reverse) I feel I do not have much to be proud of.
> Q6. I take a positive attitude toward myself.
> Q7. On the whole, I am satisfied with myself.
> Q8. (Reverse) I wish I could have more respect for myself.
> Q9. (Reverse) I certainly feel useless at times.
> Q10. (Reverse) At times I think I am no good at all.
The items were rated on the following scale where **1=strongly disagree, 2=disagree, 3=agree, and 4=strongly agree (0=no answer)**. Total score is calculated by averaging the 10 questions.
The other variables are:
**gender**. Chosen from a drop down list (1=male, 2=female, 3=other; 0=none was chosen).
**age**. Entered as a free response.
**source**. How the user came to the page. 1=front page of personality test website, 2=Google search, 3=other.
**country**. Inferred from technical information using MaxMind GeoLite.
Since the data is in comma-separated format, we can use **read.csv** function from base R to read it. R can read other data formats as well (see **impost Dataset** in the _Environment_ tab in the top-right window in RStudio).
Here we read the data and store it in the "se" object.
```{r cleaning, warning=F}
# Reading the data.
se <- read.csv("Self Esteem.csv")
# Visualizing se (too large, so we are commenting it out)
# se
# visualizing the top 10 lines of se
head(se,10)
tail(se,10)
```
## R Help
You can always get a quick help for R functions. Here we look at the help for the **head** function. You will notice that the help offered by R has always the same standard format.
```{r}
# Look for help by using a question mark.
# ?head
# You may not specify a function parameter, or its name.
# If you dont specify n, the default n=6 will be used.
head(x=se)
#If you dont specify "x=", the function can still guess that "se" is an entry for "x" parameter.
head(se)
#Here we specify all parameters and their values.
head(x=se,n=6)
#the same without the parameter name.
head(se,6)
```
## Object types and subsetting.
Objects will also have **classes**, which indicates the type of the object and that is related to what can be done with it.
```{r}
# se is a data.frame
class(se)
hse <- head(se,2)
class(hse)
# it is a data.frame object. Lets try to transform it into a matrix object.
as.matrix(hse)
# or a character object
as.character(hse)
# or a list object
as.list(hse)
# model1 is a "lm" object, which stands for linear model
model1 <- lm(Q1~Q2+Q3, data = se)
class(model1)
#s1 is of the type "summary.lm"
s1 <- summary(model1)
class(s1)
```
Many objects in R have **positions** and you can use positions to subset or extract part of an object. This can help programming.
```{r}
# first column of se (too long)
# se[1]
#columns 1 to 5 (too long)
#se[1:5]
# row 1 to 5 column 1 to 5
se[1:5,1:5]
# row 1 all columns
se[1,]
# Regression object
# In the firs position we have coefficients
model1[1]
# model1[1] is a "list"
class(model1[1])
# to access objects inside a list we need to use two square brackets
model1[[1]][2]
```
Sometimes you can access components of an object by their name using the **$** operator.
```{r}
# names of the components of model1
names(model1)
# We can get the "coefficients" part of model1 using the "$" operator.
model1$coefficients
# or, since it is the first component
model1[4]
```
# Packages
The base R is what is in R when you install it. Base R has already lots of statistical functions including plots. However, because R is an open source software, many people program their own functions and organize them in a standardized way that is called **package**. Anyone can download and use these packages and some of them became quite popular.
For example, if you want to run a Mixed Effect Model, you will need package "lme4" or "nlme". If you want to create nice plots, you may use "ggplot2". If you wan to do data cleaning chances are the "dplyr" will help you a lot.
**install.packages** - function used to install a new package in R.
**update.packages** - function that can be used to update packages if they become old. There is usually no problem using old versions of a package, but sometimes updates will fix bugs and add new functions. It is important to keep track of the version of the package you used for a given project.
**library** - function that loads the package so that you can use its functions. You can also use **require**.
```{r}
# "arrange" is a function in the "dplyr" package that sorts the data.
# This is how you could sort the data by Age
# sorted.se <- arrange(se,Age) # result in an error
# There is not function "arrange" because we have not loaded "dplyr"
# Here we install package dplyr (package used for data cleaning)
# install.packages("dplyr")
# now you can use the package by loading it first.
library(dplyr)
# now it works!
sorted.se <- arrange(se,Age)
head(sorted.se)
head(se)
```
Notice that you get some warnings:
> The following objects are masked from ‘package:stats’:
> filter, lag
> The following objects are masked from ‘package:base’:
> intersect, setdiff, setequal, union
R is telling you that the *dplyr* package has a function *filter* which also exists in the *stats* package. They are functions with the same name, but they don't do the same thing. Now, if you try to use *filter* R is going to use the one from *dplyr* package (the last loaded), not from *stats* package. If you need to use *filter* from *stats* package, you need to tell that to R using the **::** operator: *stats::filter*. It is very common that this kind of thing happens in R, so that sometimes it is safer to use the "::" operator to ensure the correct function is used.
Now we look at functions select, filter and mutate from the *dplyr* package.
```{r}
new_data <- select(se, Q1:Q10) # select variables (columns)
head(new_data,10)
new_data <- filter(se, Q1 == 4 | Q1 ==3) # select cases
head(new_data,10)
# Creating new variables with "mutate".
se <- mutate(se, fgender = factor(Gender,labels = c("Male","Female" ))) # gender into factor
```
Notice that the variable *Gender* had values 1 and 2. R will consider it a continuous variable.
The variable *fgender* is defined as a **factor** and will be interpreted by R as a categorical variable. In this case we have defined labels for each category, but it would be a categorical variable even if we had not supplied labels.
The type of variables will be very important when you run regression models in R.
# Missing values in R: NA
Missing values in R are represented in the data as the character **NA**. It does not matter if you have numeric or character variables, R will represent missing as NA. NaN can also be used if the missing data is the result of a division by zero.
In order to see some missing values in our data, lets select the records that have NA values in the variable Q5. We can do that by using the *filter* functon from the *dplyr* package and the *is.na* function from the base package. The *is.na* function returns *TRUE* whenever it finds the NA value.
```{r}
# We select a subset of the data that has NA at Q5
na.data <- filter(se,is.na(Q5))
# Look at what we got.
na.data
# Appplying the is.na function at column Q1.
is.na(na.data$Q1)
# We can have this as a new variable in the dataset.
na.data$Q1.na <- is.na(na.data$Q1)
na.data
# Or as a 0/1 variable
na.data$Q1.na <- is.na(na.data$Q1)*1
na.data
```
Some functions in R will not deal with missing values automatically. Here we have the mean. Many times when fitting models in R you may get error messages and it may be because of missing values.
```{r}
# Result is NA because there are NA value at Q1
mean(se$Q1)
# If we want the mean of the non-missing values, we have to explicitly remove NAs.
mean(se$Q1, na.rm = TRUE)
```
# Some examples of data cleaning
Here we will do a couple of data cleaning and transformation examples and we will introduce the operator **%>%** (*pipe* operator) which can be thought as a chain operator that takes the result of an operation and use it as input to the next function.
The %>% operator is from the *magrittr* package, which will automatically load with the *dplyr* package.
```{r}
# first, look at what we have in the variable source.
table(se$source)
# Do we have NAs? No, we dont...
table(se$source, useNA = "always")
# After creating a factor version of source we drop the original from the data.
se <- se %>%
mutate(sc = factor(source, labels = c("Front Page","Google"))) %>%
select(-source) # a minus sign will remove the variable.
### This is how we would do it without the %>% operator.
#se <- mutate(se, sc = factor(source, labels = c("Front Page","Google")))
#se <- select(se, -source)
### Or we could also embed the two lines into one.
#se <- select(mutate(se, sc = factor(source, labels = c("Front Page","Google"))), -source)
# In order to create a total self-esteem score we need to first reverse questions Q3 Q5 Q8 Q9 and Q10, then take the average.
se <- se %>%
mutate(Q3 = 5 - Q3,
Q5 = 5 - Q5,
Q8 = 5 - Q8,
Q9 = 5 - Q9,
Q10 = 5 - Q10)
# Now we can calculate the total score for each subject by using "rowMeans" function.
se <- se %>% mutate(total = rowMeans(.[3:12], na.rm= T))
table(se$total)
# Lets create two groups: High and Low Self-Esteem.
se <- se %>%
mutate(Low_se = factor(ifelse(total < 2,"Yes","No"), levels = c("Yes","No")))
table(se$Low_se)
```
# Statistical Inference
## Confidence Interval and test for a Single Proportion
Let's say low Self-Esteem is problematic and you want to estimate the proportion of individuals in the population that would require treatment for it. Let's consider that subjects with total score below 2 require treatment. We want to **estimate** the proportion of subjects requiring treatment in the population. Let's say that we know that in the US this proportion is 10% and we want to **test** if it is different in Canada.
```{r}
# Number of subjects for whom total score < 2
table(se$Low_se)
table(se$Low_se)/nrow(se)
#let's store this table in object "t"
t <- table(se$Low_se)
# Confidence Interval. with a test for the proportion being different from 0.1.
prop.test(t, p = 0.15)
binom.test(t, p = 0.15) # The binominal test is exact
```
## 2 X 2 Tables - Chi-Square and Fisher Test
In the scenario above we calculated only one proportion and wanted to test it against a fixed value. But it is more often the case that we have more than one proportion and wwant to compare them, in particulat to see if there is an association between two categorical variables. For example, we may want to take a look at the proportion among Male and Female and we may want to estimate and test the difference.
```{r}
# Point estimate for the proportion
prop.table(table(se$fgender, se$Low_se),1)
t <- table(se$fgender, se$Low_se)
# Test and confidence interval for the Risk Difference (Chi-square test)
p <- prop.test(t)
p
# Here we can get the actual risk difference, which is not given by prop.test.
p$estimate[1] - p$estimate[2]
# Popular Chi-Square Test
t
chisq.test(t)
# Odds Ratio and Fisher's Exact test
fisher.test(t)
```
**Risk Difference** is a measure of association for which we get a confidence interval in *prop.test*. It is just the risk of low Self-Esteem for Females (24.6%) minus the risk for Males (23.5%).
**Odds Ratio** is another more popular measure of association calculated as the Odds of low self-esteem for male divided by the odds of low self-esteem for females. The odds for Female is 24.65%/75.35% = 0.327 and for male is 23.5%/73.5%=0.307. So, the *Odds Ratio* is 0.307/0.327 = 0.939, meaning that the odds of low self-esteem is 6% lower for males as compared to females. When using the *fisher.test* function we get the estimate for the Odds Ratio and a confidence interval for it.
These are measures of association and they are important because they ways in which we can interpret and make sense of the difference between Male and Female. If, however, we want to decide whether or not the difference is real, meaning, Males have lower prevalence of low self-esteem than Females, then we can conduct a statistical test and get a p-value. THe *fisher.test* function returns the Fisher's Exact test and it is safe to use in any case. The *chisq.test* and *prop.test* return the Chi-square test and they may be imprecise for small sample size, although they are probably more popular then Fisher's test.
## Comparing more than two proportions - Chi-Square and Fisher Test
And here we replace Sex by Country. In this case, because our sample size is large and the table has many cells, the Exact Fisher's test is very computer intensive and we may need to use simulations.
```{r}
t <- table(se$country, se$Low_se)
t
prop.table(t,margin = 1)
# Tests for Overall Association
chisq.test(t)
# fisher.test(t) # We get error bacause compuation demands too much memory.
fisher.test(t, simulate.p.value = TRUE, B = 100000)
prop.table(t,margin = 1)
```
Both Chi-Square and Fisher's Exact test function will perform a single test for the full table. If you dont want to conduct many tests, you have the option to try to interpret the table in general descriptively. For example, in this case we could go ahead and interpret the table. By just looking at the row proportion table we could interpret the association by saying that the risk of low self-esteem seems lower in India than in the other countries.
## One Sample t-test
The one sample t-test is used when we want to compare the mean of a continuous variable with a fixed value. Here we look at our Self-Esteem score, which ranges from 1 to 4, and test if its mean is different from 2. This test is not very common because most of the time we want to compare means between two groups instead of comparing a mean with a fixed value. The **t.test**, when given only one variable will do the one sample t-test and return also estimation results (mean and confidence interval).
```{r}
t.test(se$total, mu = 2)
##### Only for Canada
# se$total is a column. We select only the elments in the positions se$country = CA.
t.test(se$total[se$country == "CA"], mu = 2)
# Or we can use 'filter' and 'select' function from dplyr package and the pipe operator.
se %>% filter(country == "CA") %>% select(total) %>% t.test(mu = 2)
# Show the variables involved in the calculation.
se %>% select(country, total) %>% head(20)
```
## Two sample t-test
Used when the goal is to compare the means of two independent sample. Usually independent means two groups of subjects where the subjects are not the same in both groups.
Entering the data for the *t.test* function can look a bit tricky. One way of doing it is by making use of the formula notation in R, which specifies the dependent variable (continuous variable), a tilde (~) and the independent variable (grouping variable). A perhaps easier way to understand is to just give *t.test* the two vectors which means we want to compare.
```{r}
# Using the dep ~ indep formula notation.
t.test(total ~ fgender, data = se)
# Given the function the vectors we want to compare
t.test(se$total[se$fgender == "Female"],
se$total[se$fgender == "Male"])
```
The *t.test* function will give us estimation results (Means for each group and confidence interval for the mean difference) as well as testing results (t-value, df and p-value).
The traditional t-test for two independent samples will assume equal variance in both samples, and this assumption is easily violated. So, the *t.test* in R will by default not make that assumption, it will adjust the t-test so that it can work well with unequal variances. For that reason, it is called Welch t-test.
## Paired t-test
The paired t-test is used when the means to be compared are collected in the same sample (usually in the same patients). The most common example of this situation at CAMH are pre-post samples where you want to compare the mean at pre and post (that is, the change over time). It can also happen when two raters rate the same patient and you want to compare their mean rating. In order to demonstrate the test we will compare the mean of Q1 and Q2, which are two questions completed by the same subjects, so we need paired t-test in order to compare them.
```{r}
t.test(se$Q1, se$Q2, paired = TRUE)
# Remember that you can always store your results in an object
t <- t.test(se$Q1, se$Q2, paired = TRUE)
# then retrieve parts of it.
ls(t) # ls tells us the parts t has
t$p.value # then we get the one we want
```
The way the *paired t-test* works is that it creates a third variable Q1 - Q2 and use the one sample t-test to test if this difference is zero. So, here we see that the mean of the difference is -0.079, that is, the mean of Q2 is larger than the mean of Q1. We are also given a confidence interval for this mean and the results of statistical test.
The other thing about the t-test is that it assumes normality of the data. This assumption is more important for small sample sizes but tend to be less of a problem for larger samples.
## Normality test
Normality is not a hugely important assumption, meaning that tests are usually good approximation when data is not normaly distributed. If you have a large sample size, meaningless departures from normality will be significant. If you have a small sample size, non-significant normality tests are not evidence for normality (but lack of evidence for non-normality).
```{r}
# visual inspection
hist(se$Q1 - se$Q2) # normality for the paired t-test
hist(se$total) # normality for independent sample t-test that uses total score.
# actual statistical test for normality
shapiro.test(se$total)
```
## ANOVA/ Regression
A0NOVA stands for Analysis of Variance. It is usually applied when we want to compare means across three or more groups, although in cases when there are only two groups it reduces to the two independent sample t-test.
Regression models are just an extension of ANOVA that allows you to control for other variables if you want to.
In order to ensure you account for unbalanced data as well as use the correct sum of square, ANOVA is best done in R in two steps. Step 1 is to fit a regression model and the second step is to use that model result to run the ANOVA. Although base R has the functions **anova** and **aov**, they will provide the type I sum of square which is usually not what you want. The **Anova** function in the **car** package allows you to replicate analyses that you would do in SPSS or SAS.
NOTE - The **anova** function in base R gives you type II tests, not type III tests. Type II tests will give you the smae results as type III tests if you have only one predictor (we only have country), but it will be different ifyou have more than one predictor.
```{r}
# First we fit a regression model.
# dependent variable is always the continuous variable
# independent variable is always the categorical variable.
m <- lm(total ~ country, data = se)
summary(m) # This will show you a summary of the regression results.
# the Regression results is stored in the object m which is fed to the Anova function from the "car" package - installing but not loading car package.
# install.packages("car")
# Anova(m,type = 3) # Result in an error if you dont load car.
a <- car::Anova(m, type = 3)
a
# If we load the package car, then we can use Anova directly.
library(car)
Anova(m,type = 3)
# in this case, with only one predictor, we could use the anova function from base R.
anova(m)
```
We can see that the variable *country* has a low p-value. That means that we have good evidence of differences between the five countries in average Self Esteem score. What naturally happens after such results is that one will be interested in details about the difference between the three groups. The first step is to get the average Self Esteem for each country. We can get that from the regression model directly by using estimated marginal means (means for the outcome controlled by other variables in the model, which in this case are none).
```{r}
# install.packages("emmeans")
ls <- emmeans::emmeans(m, specs = ~ country)
ls
```
We see the estimated means for each country, their Standard Errors and Confidence Intervals. A simple way to test all pairwise differences between groups is to use the **pairwise.t.test** function.
```{r}
pairwise.t.test(se$total, se$country, p.adj = "bonf")
# or from the model - this is advantageous if we have other variables in the model
ls <- emmeans::emmeans(m,specs = pairwise ~ country, adj = "bonferroni")
ls
```
The Tukey test is less rigid than Bonferroni, so it is also very popular. It needs an object of the class *aov*, which is also an analysis of variance object. We can see that the p-values are lower reflecting the fact that they have been through an adjustment that is less conservative.
```{r}
TukeyHSD(aov(m))
```
## Non-Parametric tests
Non-parametric test are so called because they don't use distributional assumptions. The Normal distribution is a *parametric* distribution so that if tests don't assume normality, or they don't make distributional assumptions, they are called non-parametric. In general these tests convert the original data into ranks so that they are in general considered to be appropriate for ordinal data and they are robust to outliers in the data.
Follow the non-parametric tests for comparing two independent groups and two dependent groups, which are often used as replacements for respectively independent sample and paired sample t test.
```{r}
# Test for Independent samples (Mann-Whitney U test)
wilcox.test(total ~ fgender, data = se, conf.int = TRUE)
# Test for paired sample (Wilcox signed rank test)
wilcox.test(se$Q1, se$Q2, paired = TRUE, conf.int = TRUE)
```
The test for independent sample is known as Mann-Whitney U test, and the paired test is known as Wilcoxon signed rank test. These tests don't compare the means, but the overall distribution of the data.
## Non-parametric ANOVA (Kruskal-Wallis)
ANOVA assumes normality and equal variance in all groups. If the outcome departs from normal or is in a ordinal scale, you may want to use a non-parametric test equivalent to ANOVA. The Kruskal-Wallis test convert the outcome into ranks and do the calculation with the in a way that does not need the assumptions of ANOVA.
Usually when one finds evidence of association, pairwise non parametric comparison is conducted using the Mann-Whitney U test.
```{r}
kruskal.test(total ~ country, data = se)
# all pairwise comparison with False Discovery Rate adjustment. We coould have used Bonferroni as well.
pairwise.wilcox.test(se$total, se$country, data = se, p.adjust.method = "fdr" )
```
## Pearson Correlation
Correlation coefficient is used to measure the **linear association** between two continuous variable. It varies between -1 and 1 with values close to zero indicating lack of association, values close to 1 indicating highly positive association and close to -1, highly negative.
```{r}
cor(se[3:12], use = "pairwise.complete.obs")
cor.test(se$Q1, se$Q2, use = "pairwise.complete.obs")
```
## Non-Parametric Correlation
Pearson Correlation assumes that the data is continuous and normally distributed. On the other hand, the Spearman correlation is based in ranks and does not make assumptions about the distribution of the data.
```{r}
# Non-parametric spearman correlation using Q1 to Q10 and Total columns.
cor(se[c(2:11,17)], use = "pairwise.complete.obs", method = "spearman")
# Alternatively we could use chained operations. Here we store the correlation in object a.
a <- se %>% select(Q1:Q10,total) %>% cor(use = "pairwise.complete.obs", method = "spearman")
# Test whether correlation is significant
cor.test(se$Q1, se$Q2, use = "pairwise.complete.obs", method = "spearman")
# The corrplot package allows you to create nice plots with correlation matrices.
# install.packages("corrplot")
corrplot::corrplot(a)
```
# Linear Regression Models
An example of using function **lm** in base R to run a linear regression model. The operator ":" indicates an interaction in the model and the operator "*" indicates the interaction plus the lower levels main effects.
```{r}
# Store regression results in object m1.
m1 <- lm(total ~ fgender + country + fgender:sc + Age*sc, data = se)
# Look at the results
summary(m1)
# Test each effect using omnibus ANOVA table.
car::Anova(m1, type = 3)
```
# Logistic Regression Models
Here we use function **glm** to fit a logistic regression model to a binary outcome in order to model the probability of having low self esteem. It is all similar to the linear model, except that here we need to specify the distribution **binomial** for the logistic regression model.
```{r}
m2 <- glm(Low_se ~ fgender + country + fgender:sc + Age*sc, data = se,
family = binomial)
summary(m2)
car::Anova(m2, type = 3)
```
# Mixed Models
Mixed Models are often used at CAMH when we have **repeated measure** data. Such data is called **longitudinal data** if the repeated measures are over time, as when we follow subjects for a time measuring them at different time points. Other times like when we measure multiple brain regions in the same subject, the repeated measure is not in time. Statistically, both situations are similar as the repeated measurements within the same sample unit (usually subjects) causes the data to have a clustering structure where measures within the cluster are more similar to each otehr (correlated) as compared to measures from different clusters.
The subjects will usually be the clusters at CAMH. In that sense, as an example, depression measures within the same patient at different time points will be correlated. This correlation needs to be accounted for in regression model and the simplest way to do that is by declaring the clusters (subjects) as random effects.
In our example dataset, we don't have repeated measures within the same subject (a metric is only measured once within each subject), but we have countries. Countries can be seen as clusters in that measures within a country will tend to be more similar than measures across countries. This also happens when we collect data from patients over multiple centers: the centers are usually considered clusters and modeled as random effects.
The function **lmer** from package **lme4** is the most popular way to run mixed models in R. Here we have the same model as above, except that we specify country as random effect (clusters). SInce p-values for mixed models are a bit controversion, the implementation of **lme4** package does not calculate p-values. However, if you use the **lmerTest** package, the same function **lmer** will followed by **anova** will give you p-values.
Notice that in a dataset like this you should control for countries either by declaring them as random effects or fixed effects (just add them to the linear model above). The definition of whether something should be specified as random effect is a bit complicated and without much consensus. Usually, if a variable can be seen as clustering structure (units within it will be more similar than units across it) or sampling clusters (if you sample schools, then students within schools, then schools are clusters), then you declare it as random effect. But this is a very basic rule as this workshop does not intend to teach mixed models.
```{r}
m3 <- lme4::lmer(total ~ fgender + fgender:sc + Age*sc +
(1|country), data = se)
summary(m3)
# No p-value and Type II test.
anova(m3)
# To get type 3 test we use lmer function from lmerTest package
m3 <- lmerTest::lmer(total ~ fgender + fgender:sc + Age*sc +
(1|country), data = se)
summary(m3)
anova(m3)
```