1 R and RStudio

R is a statistical software and a programming language. As you open R (or RStudio), many statistical functions will be available for you to use.

RStudio is an IDE (Integrated Development Environment) for R. If R is installed in your computer, RStudio should automatically find and open it. The R console appears by default on the bottom left window of RStudio.

This is an introductory workshop and we hope to give you some basic background that will allow you to figure out things on your won, as you need. We thought that this is the most benefit we can get from the time we have. So, we will start by showing you how to get a working R.

2 Before the Workshop!

2.1 Option 1 - Install it in your computer!

  1. Install R in your computer. Access the CRAN website, download and install R according to the system you have. If you have Windows, click in Download R for Windowns, then choose base, then doenload R by clicking in Download R 4.0.3 for Windown (or whichever version you see there!). Click the downloaded file and follow the instructions to install R. Note - If you don’t have administrator rights over your computer, you may not be able to install and use R. In this case, go to Option 2 below.

  2. Install RStudio in your computer. Access the RStudio website, download the free version that is appropriate for your system, and install it by clicking the downloaded file and following the instructions.

  3. After steps 1 and 2, you should find an icon that opens RStudio in your desktop or in your Windows Start menu. You don’t need to open it now, but you can use it to open RStudio and start a new analysis.

We have a short video here that also shows how to download and install R and RStudio. Notice that to access this video and many other resources, you will have to use your CAMH credentials.

  1. Create a folder somewhere in your computer and name it “IntroR”.

  2. Download the dataset “Self Esteem.csv” and save it in the “IntroR” folder.

  3. Download the R Markdown file “Intro to R.Rmd” and the HTML file Intro-to-R.html" and save them in the “IntroR” folder. (Sorry, no link for downloads are supplied since you have already downloaded these files and is reading them!).

  4. That is it! You should be ready for the workshop!

2.2 Option 2 - Use RStudio Cloud

You can use R and RStudio on your web browser, no need to install them! Here is how:

  1. Download the dataset “Self Esteem.csv” and save it somewhere in your machine. We will use this dataset in the workshop. This is in our Confluence page and it will ask for your CAMH credentials.

  2. Download the R Markdown file “Intro to R.Rmd” and the HTML file Intro-to-R.html" and save them in your computer. (Sorry, no link for downloads are supplied since you have already downloaded these files and is reading them!).

  3. Go to the RStudio Cloud website and click get started for free button.

  4. You will need to sign up with an email account, name, password.

  5. To get started with RStudio, first click in Your Workspace on the pane on the left. Then on top right, click New Project

  6. RStudio will open. Rename your project by replacing Your Workspace/Untitled at the top by Your Workspace/IntroR. You can think of this as your folder, where you will save all about your data analysis.

  7. Click the button upload on the bottom right part of RStudio and upload the file “Self Esteem.csv” and “Intro to R.Rmd” that you downloaded in Steps 1 and 2.

  8. You can sign off and close the page. In the day of the workshop, just sign into RStudio Cloud website again and click “Your Workspace” and “IntroR” project.

2.3 Option 3 - Follow Along

We realize that it may be a bit difficult for you to do the activities in a Webex workshop as both screens will be in the same computer (You are watching us on Webex and simultaneously doing things on RStudio). So, if you feel it is easier, just sit back, relax and follow along. You would be able to replicate all we did on your own by using this markdown file.

3 R Markdown warm-up

To get a new markdown document, go to File > New File > R Markdown, check “HTML” option and click “ok”.

R markdown It is a way to create nice documents where you have your R code and its output, along with normal explanatory text. All in the same place. It makes it very easy for someone else to understand what you did and modify it, which makes your research very reproducible.

Rmarkdown allows you to add formatting to your text in a simple way. Here are some formatting tips.

5 Some R scripts

Below we have a chunk. Anything inside the chunk will be interpreted as R code, that is, R will try to run it. Except when there is a hash-tag #, in which case R interprets as a comment and will not run it.

To run a line that is inside a chunk, you just need to put the pointer on that line and press CTRL + ENTER. Or go to Run on the top-right part of your markdown and select a running option there.

Lets run a few things and start getting a feeling for R!

# Just print something on the R console.
print("Hello World!")
## [1] "Hello World!"
p <- "Hello World!"
p 
## [1] "Hello World!"
# The seq function creates a sequence of values.
seq(from=1,to=100,by=5)
##  [1]  1  6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
# But it is printed in the console and lost. Lets store it in object "a".
a <- seq(from=1,to=100,by=5)
a # Showing what is in a.
##  [1]  1  6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
# Now we can use the sequence for whatever we like.
a /7
##  [1]  0.1428571  0.8571429  1.5714286  2.2857143  3.0000000  3.7142857
##  [7]  4.4285714  5.1428571  5.8571429  6.5714286  7.2857143  8.0000000
## [13]  8.7142857  9.4285714 10.1428571 10.8571429 11.5714286 12.2857143
## [19] 13.0000000 13.7142857
# Repeating values a certain number of times. It will overwrite object "a".
a <- rep(c(1,5),each=10) # Repeat 1 and 5, each 10 times.
a
##  [1] 1 1 1 1 1 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5
# We can generate random numbers, which are important in simulations.
a <- runif(n=100,min=1,max=10)

# We can select a sample with replacement.
sample(c("A","B","C",NA),10, replace = T)
##  [1] "A" "A" "C" "A" "B" "A" "C" "C" "C" "A"

And there are many, many more things you can do! But we are interested in Statistics and Data Analysis. So, lets read our dataset into R and start having fun it it!

6 Reading a dataset

We have a “.csv” dataset that contains the Self-Esteem questionnaire and some demographics. What we will use is a subset of the complete data which is publicly available. It is important to understand the data and we now provide a description of the variables in it.

Here are the working of questions Q1 to Q10 from Self-Esteem Scale (see Rosenberg, M. (1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press).:

Q1. I feel that I am a person of worth, at least on an equal plane with others.

Q2. I feel that I have a number of good qualities.

Q3. (Reverse) All in all, I am inclined to feel that I am a failure.

Q4. I am able to do things as well as most other people.

Q5. (Reverse) I feel I do not have much to be proud of.

Q6. I take a positive attitude toward myself.

Q7. On the whole, I am satisfied with myself.

Q8. (Reverse) I wish I could have more respect for myself.

Q9. (Reverse) I certainly feel useless at times.

Q10. (Reverse) At times I think I am no good at all.

The items were rated on the following scale where 1=strongly disagree, 2=disagree, 3=agree, and 4=strongly agree (0=no answer). Total score is calculated by averaging the 10 questions.

The other variables are:

gender. Chosen from a drop down list (1=male, 2=female, 3=other; 0=none was chosen).

age. Entered as a free response.

source. How the user came to the page. 1=front page of personality test website, 2=Google search, 3=other.

country. Inferred from technical information using MaxMind GeoLite.

Since the data is in comma-separated format, we can use read.csv function from base R to read it. R can read other data formats as well (see impost Dataset in the Environment tab in the top-right window in RStudio).

Here we read the data and store it in the “se” object.

# Reading the data.
se <- read.csv("Self Esteem.csv")

# Visualizing se (too large, so we are commenting it out)
# se

# visualizing the top 10 lines of se
head(se,10)
##    ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1     11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2     31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3     56  2  4  2  2  3  3  2  4  3   2      2  49      1      CA
## 4     92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 5    114  2  2  3  2  3  2  2  3  4   4      1  39      1      US
## 6    123  3  4  2  4  2  3  3  2  3   2      2  35      1      US
## 7    161  4  4  4  4  4  3  1  2  1   2      2  33      1      US
## 8    207  3  3  2  4  2  3  3  2  2   1      2  26      1      US
## 9    242  3  3  2  2  2  2  2  2  3   3      2  59      1      US
## 10   250  4  4  1  4  1  4  2  3  2   1      1  27      1      US
tail(se,10)
##      ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 2485 47845  3  2  3  2  3  2  2  3  4   4      2  19      1      GB
## 2486 47878  4  4  1  4  1  3  4  1  1   1      2  33      1      US
## 2487 47882  4  3  3  3  1  4  4  1  2   1      2  22      1      CA
## 2488 47899  3  3  2  3  2  2  2  2  4   2      1  17      1      GB
## 2489 47912  3  3  2  3  2  3  2  3  3   3      2  43      1      US
## 2490 47918  4  4  1  4  1  4  4  2  1   1      2  17      1      US
## 2491 47926  4  4  1  3  1  3  4  3  3   3      1  32      1      GB
## 2492 47932  3  4  3  4  1  3  2  3  4   4      2  27      1      GB
## 2493 47943  1  2  3  3  3  1  2  4  4   4      2  17      1      GB
## 2494 47946  3  3  1  3  1  3  3  2  2   1      2  22      2      US

6.1 R Help

You can always get a quick help for R functions. Here we look at the help for the head function. You will notice that the help offered by R has always the same standard format.

# Look for help by using a question mark.
# ?head

# You may not specify a function parameter, or its name.
# If you dont specify n, the default n=6 will be used.
head(x=se)
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1    11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2    31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3    56  2  4  2  2  3  3  2  4  3   2      2  49      1      CA
## 4    92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 5   114  2  2  3  2  3  2  2  3  4   4      1  39      1      US
## 6   123  3  4  2  4  2  3  3  2  3   2      2  35      1      US
#If you dont specify "x=", the function can still guess that "se" is an entry for "x" parameter.
head(se)
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1    11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2    31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3    56  2  4  2  2  3  3  2  4  3   2      2  49      1      CA
## 4    92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 5   114  2  2  3  2  3  2  2  3  4   4      1  39      1      US
## 6   123  3  4  2  4  2  3  3  2  3   2      2  35      1      US
#Here we specify all parameters and their values.
head(x=se,n=6)
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1    11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2    31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3    56  2  4  2  2  3  3  2  4  3   2      2  49      1      CA
## 4    92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 5   114  2  2  3  2  3  2  2  3  4   4      1  39      1      US
## 6   123  3  4  2  4  2  3  3  2  3   2      2  35      1      US
#the same without the parameter name.
head(se,6)
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1    11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2    31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3    56  2  4  2  2  3  3  2  4  3   2      2  49      1      CA
## 4    92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 5   114  2  2  3  2  3  2  2  3  4   4      1  39      1      US
## 6   123  3  4  2  4  2  3  3  2  3   2      2  35      1      US

6.2 Object types and subsetting.

Objects will also have classes, which indicates the type of the object and that is related to what can be done with it.

# se is a data.frame
class(se)
## [1] "data.frame"
hse <- head(se,2)
class(hse)
## [1] "data.frame"
# it is a data.frame object. Lets try to transform it into a matrix object.
as.matrix(hse)
##   ï..ID Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  Q10 Gender Age  source country
## 1 "11"  "4" "4" "3" "4" "3" "3" "2" "2" "3" "3" "1"    "42" "1"    "US"   
## 2 "31"  "3" "4" "1" "4" "2" "4" "3" "2" "4" "1" "1"    "33" "1"    "US"
# or a character object
as.character(hse)
##  [1] "c(11, 31)"         "4:3"               "c(4, 4)"          
##  [4] "c(3, 1)"           "c(4, 4)"           "3:2"              
##  [7] "3:4"               "2:3"               "c(2, 2)"          
## [10] "3:4"               "c(3, 1)"           "c(1, 1)"          
## [13] "c(42, 33)"         "c(1, 1)"           "c(\"US\", \"US\")"
# or a list object
as.list(hse)
## $ï..ID
## [1] 11 31
## 
## $Q1
## [1] 4 3
## 
## $Q2
## [1] 4 4
## 
## $Q3
## [1] 3 1
## 
## $Q4
## [1] 4 4
## 
## $Q5
## [1] 3 2
## 
## $Q6
## [1] 3 4
## 
## $Q7
## [1] 2 3
## 
## $Q8
## [1] 2 2
## 
## $Q9
## [1] 3 4
## 
## $Q10
## [1] 3 1
## 
## $Gender
## [1] 1 1
## 
## $Age
## [1] 42 33
## 
## $source
## [1] 1 1
## 
## $country
## [1] "US" "US"
# model1 is a "lm" object, which stands for linear model
model1 <- lm(Q1~Q2+Q3, data = se)
class(model1)
## [1] "lm"
#s1 is of the type "summary.lm"
s1 <- summary(model1)
class(s1)
## [1] "summary.lm"

Many objects in R have positions and you can use positions to subset or extract part of an object. This can help programming.

# first column of se (too long)
# se[1]
#columns 1 to 5 (too long)
#se[1:5]
# row 1 to 5 column 1 to 5
se[1:5,1:5]
##   ï..ID Q1 Q2 Q3 Q4
## 1    11  4  4  3  4
## 2    31  3  4  1  4
## 3    56  2  4  2  2
## 4    92  3  4  2  4
## 5   114  2  2  3  2
# row 1 all columns
se[1,]
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1    11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
# Regression object
# In the firs position we have coefficients
model1[1]
## $coefficients
## (Intercept)          Q2          Q3 
##   1.5142097   0.6597251  -0.2320270
# model1[1] is a "list"
class(model1[1])
## [1] "list"
# to access objects inside a list we need to use two square brackets
model1[[1]][2]
##        Q2 
## 0.6597251

Sometimes you can access components of an object by their name using the $ operator.

# names of the components of model1
names(model1)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"
# We can get the "coefficients" part of model1 using the "$" operator.
model1$coefficients
## (Intercept)          Q2          Q3 
##   1.5142097   0.6597251  -0.2320270
# or, since it is the first component
model1[4]
## $rank
## [1] 3

7 Packages

The base R is what is in R when you install it. Base R has already lots of statistical functions including plots. However, because R is an open source software, many people program their own functions and organize them in a standardized way that is called package. Anyone can download and use these packages and some of them became quite popular.

For example, if you want to run a Mixed Effect Model, you will need package “lme4” or “nlme”. If you want to create nice plots, you may use “ggplot2”. If you wan to do data cleaning chances are the “dplyr” will help you a lot.

install.packages - function used to install a new package in R.

update.packages - function that can be used to update packages if they become old. There is usually no problem using old versions of a package, but sometimes updates will fix bugs and add new functions. It is important to keep track of the version of the package you used for a given project.

library - function that loads the package so that you can use its functions. You can also use require.

# "arrange" is a function in the "dplyr" package that sorts the data.
# This is how you could sort the data by Age
# sorted.se <- arrange(se,Age)  # result in an error

# There is not function "arrange" because we have not loaded "dplyr"

# Here we install package dplyr (package used for data cleaning)
# install.packages("dplyr")

# now you can use the package by loading it first.
library(dplyr)

# now it works!
sorted.se <- arrange(se,Age)
head(sorted.se)
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1  1204  3  3  3  2  2 NA  2  4  4   4      2  15      1      CA
## 2  2742  4  4  2  4  2  4  4  3  2   1      1  15      1      AU
## 3  2802  3  3  4  2  3  2  2  3  4   3      2  15      1      US
## 4  3156  2  3  3  4  2  1  1  4  4   4      2  15      1      US
## 5  3202  2  2  3  2  4  1  1  2  3   4      2  15      1      CA
## 6  4226  3  3  2  3  3  2  2  3  3   3      2  15      1      US
head(se)
##   ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1    11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2    31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3    56  2  4  2  2  3  3  2  4  3   2      2  49      1      CA
## 4    92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 5   114  2  2  3  2  3  2  2  3  4   4      1  39      1      US
## 6   123  3  4  2  4  2  3  3  2  3   2      2  35      1      US

Notice that you get some warnings:

The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union

R is telling you that the dplyr package has a function filter which also exists in the stats package. They are functions with the same name, but they don’t do the same thing. Now, if you try to use filter R is going to use the one from dplyr package (the last loaded), not from stats package. If you need to use filter from stats package, you need to tell that to R using the :: operator: stats::filter. It is very common that this kind of thing happens in R, so that sometimes it is safer to use the “::” operator to ensure the correct function is used.

Now we look at functions select, filter and mutate from the dplyr package.

new_data <- select(se, Q1:Q10)  # select variables (columns)
head(new_data,10)
##    Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10
## 1   4  4  3  4  3  3  2  2  3   3
## 2   3  4  1  4  2  4  3  2  4   1
## 3   2  4  2  2  3  3  2  4  3   2
## 4   3  4  2  4  2  3  1  3  3   3
## 5   2  2  3  2  3  2  2  3  4   4
## 6   3  4  2  4  2  3  3  2  3   2
## 7   4  4  4  4  4  3  1  2  1   2
## 8   3  3  2  4  2  3  3  2  2   1
## 9   3  3  2  2  2  2  2  2  3   3
## 10  4  4  1  4  1  4  2  3  2   1
new_data <- filter(se, Q1 == 4 | Q1 ==3) # select cases
head(new_data,10)
##    ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country
## 1     11  4  4  3  4  3  3  2  2  3   3      1  42      1      US
## 2     31  3  4  1  4  2  4  3  2  4   1      1  33      1      US
## 3     92  3  4  2  4  2  3  1  3  3   3      1  27      1      US
## 4    123  3  4  2  4  2  3  3  2  3   2      2  35      1      US
## 5    161  4  4  4  4  4  3  1  2  1   2      2  33      1      US
## 6    207  3  3  2  4  2  3  3  2  2   1      2  26      1      US
## 7    242  3  3  2  2  2  2  2  2  3   3      2  59      1      US
## 8    250  4  4  1  4  1  4  2  3  2   1      1  27      1      US
## 9    269  4  4  1  3  1  4  4  1  1   1      2  22      1      US
## 10   315  3  3  2 NA  2  3  4  2  3   2      2  50      1      GB
# Creating new variables with "mutate".
se <- mutate(se, fgender = factor(Gender,labels = c("Male","Female" ))) # gender into factor

Notice that the variable Gender had values 1 and 2. R will consider it a continuous variable.

The variable fgender is defined as a factor and will be interpreted by R as a categorical variable. In this case we have defined labels for each category, but it would be a categorical variable even if we had not supplied labels.

The type of variables will be very important when you run regression models in R.

8 Missing values in R: NA

Missing values in R are represented in the data as the character NA. It does not matter if you have numeric or character variables, R will represent missing as NA. NaN can also be used if the missing data is the result of a division by zero.

In order to see some missing values in our data, lets select the records that have NA values in the variable Q5. We can do that by using the filter functon from the dplyr package and the is.na function from the base package. The is.na function returns TRUE whenever it finds the NA value.

# We select a subset of the data that has NA at Q5
na.data <- filter(se,is.na(Q5))
# Look at what we got.
na.data
##    ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country fgender
## 1    986  3  3  4  3 NA  2  2  4  4   3      1  43      1      AU    Male
## 2   3556  4  4  1  4 NA  4  4  1  1   1      2  43      1      US  Female
## 3   5662  2  3  2  3 NA  2  2  4  3   3      2  16      1      AU  Female
## 4   5795  3  3  2  3 NA  3  3  3  3   3      2  16      1      AU  Female
## 5   9739  3  3  3  3 NA  1  1  1  3   3      2  17      1      US  Female
## 6  11981  2  2  4  3 NA  1  1  4  4   4      1  18      1      US    Male
## 7  14882  2 NA NA  2 NA  2 NA  3 NA   3      1  43      2      GB    Male
## 8  15907 NA NA NA NA NA NA NA NA NA  NA      2  20      1      CA  Female
## 9  17121  3  3  2  2 NA  2  3  3  2   3      1  40      1      CA    Male
## 10 19533  2  3  3  4 NA  2  2  3  4   3      2  18      1      US  Female
## 11 20869  3  3  3  3 NA  2  2  3  3   2      2  40      2      CA  Female
## 12 24616  1  2  4  2 NA  1  1  1  4   4      1  15      1      US    Male
## 13 25840  1  3  4  3 NA  3  3  3  2   2      2  17      1      IN  Female
## 14 28997  4  4  1  4 NA  3  4  1  3   2      1  44      1      US    Male
## 15 40825 NA NA NA NA NA NA NA NA NA  NA      1  56      2      GB    Male
## 16 45165 NA NA NA NA NA NA NA NA NA  NA      1  24      2      US    Male
# Appplying the is.na function at column Q1.
is.na(na.data$Q1)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE  TRUE
# We can have this as a new variable in the dataset.
na.data$Q1.na <- is.na(na.data$Q1)
na.data
##    ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country fgender Q1.na
## 1    986  3  3  4  3 NA  2  2  4  4   3      1  43      1      AU    Male FALSE
## 2   3556  4  4  1  4 NA  4  4  1  1   1      2  43      1      US  Female FALSE
## 3   5662  2  3  2  3 NA  2  2  4  3   3      2  16      1      AU  Female FALSE
## 4   5795  3  3  2  3 NA  3  3  3  3   3      2  16      1      AU  Female FALSE
## 5   9739  3  3  3  3 NA  1  1  1  3   3      2  17      1      US  Female FALSE
## 6  11981  2  2  4  3 NA  1  1  4  4   4      1  18      1      US    Male FALSE
## 7  14882  2 NA NA  2 NA  2 NA  3 NA   3      1  43      2      GB    Male FALSE
## 8  15907 NA NA NA NA NA NA NA NA NA  NA      2  20      1      CA  Female  TRUE
## 9  17121  3  3  2  2 NA  2  3  3  2   3      1  40      1      CA    Male FALSE
## 10 19533  2  3  3  4 NA  2  2  3  4   3      2  18      1      US  Female FALSE
## 11 20869  3  3  3  3 NA  2  2  3  3   2      2  40      2      CA  Female FALSE
## 12 24616  1  2  4  2 NA  1  1  1  4   4      1  15      1      US    Male FALSE
## 13 25840  1  3  4  3 NA  3  3  3  2   2      2  17      1      IN  Female FALSE
## 14 28997  4  4  1  4 NA  3  4  1  3   2      1  44      1      US    Male FALSE
## 15 40825 NA NA NA NA NA NA NA NA NA  NA      1  56      2      GB    Male  TRUE
## 16 45165 NA NA NA NA NA NA NA NA NA  NA      1  24      2      US    Male  TRUE
# Or as a 0/1 variable
na.data$Q1.na <- is.na(na.data$Q1)*1
na.data
##    ï..ID Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Gender Age source country fgender Q1.na
## 1    986  3  3  4  3 NA  2  2  4  4   3      1  43      1      AU    Male     0
## 2   3556  4  4  1  4 NA  4  4  1  1   1      2  43      1      US  Female     0
## 3   5662  2  3  2  3 NA  2  2  4  3   3      2  16      1      AU  Female     0
## 4   5795  3  3  2  3 NA  3  3  3  3   3      2  16      1      AU  Female     0
## 5   9739  3  3  3  3 NA  1  1  1  3   3      2  17      1      US  Female     0
## 6  11981  2  2  4  3 NA  1  1  4  4   4      1  18      1      US    Male     0
## 7  14882  2 NA NA  2 NA  2 NA  3 NA   3      1  43      2      GB    Male     0
## 8  15907 NA NA NA NA NA NA NA NA NA  NA      2  20      1      CA  Female     1
## 9  17121  3  3  2  2 NA  2  3  3  2   3      1  40      1      CA    Male     0
## 10 19533  2  3  3  4 NA  2  2  3  4   3      2  18      1      US  Female     0
## 11 20869  3  3  3  3 NA  2  2  3  3   2      2  40      2      CA  Female     0
## 12 24616  1  2  4  2 NA  1  1  1  4   4      1  15      1      US    Male     0
## 13 25840  1  3  4  3 NA  3  3  3  2   2      2  17      1      IN  Female     0
## 14 28997  4  4  1  4 NA  3  4  1  3   2      1  44      1      US    Male     0
## 15 40825 NA NA NA NA NA NA NA NA NA  NA      1  56      2      GB    Male     1
## 16 45165 NA NA NA NA NA NA NA NA NA  NA      1  24      2      US    Male     1

Some functions in R will not deal with missing values automatically. Here we have the mean. Many times when fitting models in R you may get error messages and it may be because of missing values.

# Result is NA because there are NA value at Q1
mean(se$Q1)
## [1] NA
# If we want the mean of the non-missing values, we have to explicitly remove NAs.
mean(se$Q1, na.rm = TRUE)
## [1] 3.004418

9 Some examples of data cleaning

Here we will do a couple of data cleaning and transformation examples and we will introduce the operator %>% (pipe operator) which can be thought as a chain operator that takes the result of an operation and use it as input to the next function.

The %>% operator is from the magrittr package, which will automatically load with the dplyr package.

# first, look at what we have in the variable source.
table(se$source)
## 
##    1    2 
## 2313  181
# Do we have NAs? No, we dont...
table(se$source, useNA = "always")
## 
##    1    2 <NA> 
## 2313  181    0
# After creating a factor version of source we drop the original from the data. 
se <- se %>%
  mutate(sc = factor(source, labels = c("Front Page","Google"))) %>%
  select(-source) # a minus sign will remove the variable.

### This is how we would do it without the %>% operator.
#se <- mutate(se, sc = factor(source, labels = c("Front Page","Google")))
#se <- select(se, -source)

### Or we could also embed the two lines into one.
#se <- select(mutate(se, sc = factor(source, labels = c("Front Page","Google"))), -source)


# In order to create a total self-esteem score we need to first reverse questions Q3 Q5 Q8 Q9 and Q10, then take the average.
se <- se %>%
  mutate(Q3 = 5 - Q3,
         Q5 = 5 - Q5,
         Q8 = 5 - Q8,
         Q9 = 5 - Q9,
         Q10 = 5 - Q10)

# Now we can calculate the total score for each subject by using "rowMeans" function.
se <- se %>% mutate(total = rowMeans(.[3:12], na.rm= T))

table(se$total)
## 
##                1              1.1 1.11111111111111              1.2 
##                7               19                1               21 
##              1.3 1.33333333333333              1.4 1.44444444444444 
##               41                2               51                1 
##              1.5 1.55555555555556              1.6 1.66666666666667 
##               73                1               62                1 
##              1.7 1.77777777777778              1.8 1.88888888888889 
##               92                4              115                4 
##              1.9                2              2.1 2.11111111111111 
##              109              124              112                2 
##              2.2 2.22222222222222              2.3 2.33333333333333 
##              135                8              133                5 
##              2.4 2.44444444444444              2.5 2.55555555555556 
##              122                3              139                4 
##              2.6 2.66666666666667              2.7 2.77777777777778 
##              115                6              128                4 
##              2.8 2.88888888888889              2.9                3 
##              113                2              108              106 
##              3.1 3.11111111111111              3.2 3.22222222222222 
##               78                1               83                3 
##              3.3 3.33333333333333              3.4 3.44444444444444 
##               52                2               83                1 
##              3.5 3.55555555555556              3.6              3.7 
##               68                2               50               58 
## 3.77777777777778              3.8 
##                3               37
# Lets create two groups: High and Low Self-Esteem. 
se <- se %>%
  mutate(Low_se = factor(ifelse(total < 2,"Yes","No"), levels = c("Yes","No")))
table(se$Low_se)
## 
##  Yes   No 
##  604 1890

10 Statistical Inference

10.1 Confidence Interval and test for a Single Proportion

Let’s say low Self-Esteem is problematic and you want to estimate the proportion of individuals in the population that would require treatment for it. Let’s consider that subjects with total score below 2 require treatment. We want to estimate the proportion of subjects requiring treatment in the population. Let’s say that we know that in the US this proportion is 10% and we want to test if it is different in Canada.

# Number of subjects for whom total score  < 2
table(se$Low_se)
## 
##  Yes   No 
##  604 1890
table(se$Low_se)/nrow(se)
## 
##       Yes        No 
## 0.2421812 0.7578188
#let's store this table in object "t"
t <- table(se$Low_se)

# Confidence Interval. with a test for the proportion being different from 0.1.
prop.test(t, p = 0.15)
## 
##  1-sample proportions test with continuity correction
## 
## data:  t, null probability 0.15
## X-squared = 165.49, df = 1, p-value < 2.2e-16
## alternative hypothesis: true p is not equal to 0.15
## 95 percent confidence interval:
##  0.2255773 0.2595877
## sample estimates:
##         p 
## 0.2421812
binom.test(t, p = 0.15) # The binominal test is exact
## 
##  Exact binomial test
## 
## data:  t
## number of successes = 604, number of trials = 2494, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.15
## 95 percent confidence interval:
##  0.2254780 0.2594919
## sample estimates:
## probability of success 
##              0.2421812

10.2 2 X 2 Tables - Chi-Square and Fisher Test

In the scenario above we calculated only one proportion and wanted to test it against a fixed value. But it is more often the case that we have more than one proportion and wwant to compare them, in particulat to see if there is an association between two categorical variables. For example, we may want to take a look at the proportion among Male and Female and we may want to estimate and test the difference.

# Point estimate for the proportion
prop.table(table(se$fgender, se$Low_se),1)
##         
##                Yes        No
##   Male   0.2349785 0.7650215
##   Female 0.2464789 0.7535211
t <- table(se$fgender, se$Low_se)

# Test and confidence interval for the Risk Difference (Chi-square test)
p <- prop.test(t)
p
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  t
## X-squared = 0.36032, df = 1, p-value = 0.5483
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.04696473  0.02396406
## sample estimates:
##    prop 1    prop 2 
## 0.2349785 0.2464789
# Here we can get the actual risk difference, which is not given by prop.test.
p$estimate[1] - p$estimate[2]
##      prop 1 
## -0.01150033
# Popular Chi-Square Test
t
##         
##           Yes   No
##   Male    219  713
##   Female  385 1177
chisq.test(t)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  t
## X-squared = 0.36032, df = 1, p-value = 0.5483
# Odds Ratio and Fisher's Exact test
fisher.test(t)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  t
## p-value = 0.5302
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.7722659 1.1402167
## sample estimates:
## odds ratio 
##  0.9390173

Risk Difference is a measure of association for which we get a confidence interval in prop.test. It is just the risk of low Self-Esteem for Females (24.6%) minus the risk for Males (23.5%).

Odds Ratio is another more popular measure of association calculated as the Odds of low self-esteem for male divided by the odds of low self-esteem for females. The odds for Female is 24.65%/75.35% = 0.327 and for male is 23.5%/73.5%=0.307. So, the Odds Ratio is 0.307/0.327 = 0.939, meaning that the odds of low self-esteem is 6% lower for males as compared to females. When using the fisher.test function we get the estimate for the Odds Ratio and a confidence interval for it.

These are measures of association and they are important because they ways in which we can interpret and make sense of the difference between Male and Female. If, however, we want to decide whether or not the difference is real, meaning, Males have lower prevalence of low self-esteem than Females, then we can conduct a statistical test and get a p-value. THe fisher.test function returns the Fisher’s Exact test and it is safe to use in any case. The chisq.test and prop.test return the Chi-square test and they may be imprecise for small sample size, although they are probably more popular then Fisher’s test.

10.3 Comparing more than two proportions - Chi-Square and Fisher Test

And here we replace Sex by Country. In this case, because our sample size is large and the table has many cells, the Exact Fisher’s test is very computer intensive and we may need to use simulations.

t <- table(se$country, se$Low_se)
t
##     
##       Yes   No
##   AU   40  126
##   CA   49  150
##   GB  129  337
##   IN   12   86
##   US  374 1191
prop.table(t,margin = 1)
##     
##            Yes        No
##   AU 0.2409639 0.7590361
##   CA 0.2462312 0.7537688
##   GB 0.2768240 0.7231760
##   IN 0.1224490 0.8775510
##   US 0.2389776 0.7610224
# Tests for Overall Association
chisq.test(t)
## 
##  Pearson's Chi-squared test
## 
## data:  t
## X-squared = 10.809, df = 4, p-value = 0.0288
# fisher.test(t) # We get error bacause compuation demands too much memory.
fisher.test(t, simulate.p.value = TRUE, B = 100000) 
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  1e+05 replicates)
## 
## data:  t
## p-value = 0.02168
## alternative hypothesis: two.sided
prop.table(t,margin = 1)
##     
##            Yes        No
##   AU 0.2409639 0.7590361
##   CA 0.2462312 0.7537688
##   GB 0.2768240 0.7231760
##   IN 0.1224490 0.8775510
##   US 0.2389776 0.7610224

Both Chi-Square and Fisher’s Exact test function will perform a single test for the full table. If you dont want to conduct many tests, you have the option to try to interpret the table in general descriptively. For example, in this case we could go ahead and interpret the table. By just looking at the row proportion table we could interpret the association by saying that the risk of low self-esteem seems lower in India than in the other countries.

10.4 One Sample t-test

The one sample t-test is used when we want to compare the mean of a continuous variable with a fixed value. Here we look at our Self-Esteem score, which ranges from 1 to 4, and test if its mean is different from 2. This test is not very common because most of the time we want to compare means between two groups instead of comparing a mean with a fixed value. The t.test, when given only one variable will do the one sample t-test and return also estimation results (mean and confidence interval).

t.test(se$total, mu = 2)
## 
##  One Sample t-test
## 
## data:  se$total
## t = 35.606, df = 2493, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
##  2.443500 2.495197
## sample estimates:
## mean of x 
##  2.469349
##### Only for Canada
# se$total is a column. We select only the elments in the positions se$country = CA.
t.test(se$total[se$country == "CA"], mu = 2)
## 
##  One Sample t-test
## 
## data:  se$total[se$country == "CA"]
## t = 9.946, df = 198, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
##  2.361247 2.539925
## sample estimates:
## mean of x 
##  2.450586
# Or we can use 'filter' and 'select' function from dplyr package and the pipe operator.
se %>% filter(country == "CA") %>% select(total) %>% t.test(mu = 2)
## 
##  One Sample t-test
## 
## data:  .
## t = 9.946, df = 198, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
##  2.361247 2.539925
## sample estimates:
## mean of x 
##  2.450586
# Show the variables involved in the calculation.
se %>% select(country, total) %>% head(20)
##    country    total
## 1       US 2.500000
## 2       US 3.100000
## 3       CA 2.400000
## 4       US 2.500000
## 5       US 1.700000
## 6       US 3.000000
## 7       US 2.600000
## 8       US 3.100000
## 9       US 2.400000
## 10      US 3.200000
## 11      US 3.700000
## 12      GB 2.888889
## 13      US 3.700000
## 14      US 1.500000
## 15      CA 3.300000
## 16      US 2.100000
## 17      US 2.700000
## 18      US 3.200000
## 19      US 3.600000
## 20      GB 2.600000

10.5 Two sample t-test

Used when the goal is to compare the means of two independent sample. Usually independent means two groups of subjects where the subjects are not the same in both groups.

Entering the data for the t.test function can look a bit tricky. One way of doing it is by making use of the formula notation in R, which specifies the dependent variable (continuous variable), a tilde (~) and the independent variable (grouping variable). A perhaps easier way to understand is to just give t.test the two vectors which means we want to compare.

# Using the dep ~ indep formula notation.
t.test(total ~ fgender, data = se)
## 
##  Welch Two Sample t-test
## 
## data:  total by fgender
## t = 0.67014, df = 1921.5, p-value = 0.5028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03538988  0.07212909
## sample estimates:
##   mean in group Male mean in group Female 
##             2.480854             2.462484
# Given the function the vectors we want to compare
t.test(se$total[se$fgender == "Female"], 
       se$total[se$fgender == "Male"])
## 
##  Welch Two Sample t-test
## 
## data:  se$total[se$fgender == "Female"] and se$total[se$fgender == "Male"]
## t = -0.67014, df = 1921.5, p-value = 0.5028
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07212909  0.03538988
## sample estimates:
## mean of x mean of y 
##  2.462484  2.480854

The t.test function will give us estimation results (Means for each group and confidence interval for the mean difference) as well as testing results (t-value, df and p-value).

The traditional t-test for two independent samples will assume equal variance in both samples, and this assumption is easily violated. So, the t.test in R will by default not make that assumption, it will adjust the t-test so that it can work well with unequal variances. For that reason, it is called Welch t-test.

10.6 Paired t-test

The paired t-test is used when the means to be compared are collected in the same sample (usually in the same patients). The most common example of this situation at CAMH are pre-post samples where you want to compare the mean at pre and post (that is, the change over time). It can also happen when two raters rate the same patient and you want to compare their mean rating. In order to demonstrate the test we will compare the mean of Q1 and Q2, which are two questions completed by the same subjects, so we need paired t-test in order to compare them.

t.test(se$Q1, se$Q2, paired = TRUE)
## 
##  Paired t-test
## 
## data:  se$Q1 and se$Q2
## t = -6.3066, df = 2482, p-value = 3.366e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.10400873 -0.05467029
## sample estimates:
## mean of the differences 
##             -0.07933951
# Remember that you can always store your results in an object
t <- t.test(se$Q1, se$Q2, paired = TRUE)
# then retrieve parts of it.
ls(t)           # ls tells us the parts t has
##  [1] "alternative" "conf.int"    "data.name"   "estimate"    "method"     
##  [6] "null.value"  "p.value"     "parameter"   "statistic"   "stderr"
t$p.value       # then we get the one we want
## [1] 3.366146e-10

The way the paired t-test works is that it creates a third variable Q1 - Q2 and use the one sample t-test to test if this difference is zero. So, here we see that the mean of the difference is -0.079, that is, the mean of Q2 is larger than the mean of Q1. We are also given a confidence interval for this mean and the results of statistical test.

The other thing about the t-test is that it assumes normality of the data. This assumption is more important for small sample sizes but tend to be less of a problem for larger samples.

10.7 Normality test

Normality is not a hugely important assumption, meaning that tests are usually good approximation when data is not normaly distributed. If you have a large sample size, meaningless departures from normality will be significant. If you have a small sample size, non-significant normality tests are not evidence for normality (but lack of evidence for non-normality).

# visual inspection
hist(se$Q1 - se$Q2) # normality for the paired t-test

hist(se$total)      # normality for independent sample t-test that uses total score.

# actual statistical test for normality
shapiro.test(se$total)
## 
##  Shapiro-Wilk normality test
## 
## data:  se$total
## W = 0.983, p-value < 2.2e-16

10.8 ANOVA/ Regression

A0NOVA stands for Analysis of Variance. It is usually applied when we want to compare means across three or more groups, although in cases when there are only two groups it reduces to the two independent sample t-test.

Regression models are just an extension of ANOVA that allows you to control for other variables if you want to.

In order to ensure you account for unbalanced data as well as use the correct sum of square, ANOVA is best done in R in two steps. Step 1 is to fit a regression model and the second step is to use that model result to run the ANOVA. Although base R has the functions anova and aov, they will provide the type I sum of square which is usually not what you want. The Anova function in the car package allows you to replicate analyses that you would do in SPSS or SAS.

NOTE - The anova function in base R gives you type II tests, not type III tests. Type II tests will give you the smae results as type III tests if you have only one predictor (we only have country), but it will be different ifyou have more than one predictor.

# First we fit a regression model.
# dependent variable is always the continuous variable
# independent variable is always the categorical variable. 
m <- lm(total ~ country, data = se)
summary(m)  # This will show you a summary of the regression results.
## 
## Call:
## lm(formula = total ~ country, data = se)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.70159 -0.49592  0.00408  0.49841  1.45460 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.452209   0.050810  48.262  < 2e-16 ***
## countryCA   -0.001623   0.068813  -0.024  0.98119    
## countryGB   -0.106811   0.059172  -1.805  0.07118 .  
## countryIN    0.249378   0.083395   2.990  0.00281 ** 
## countryUS    0.043709   0.053437   0.818  0.41346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6546 on 2489 degrees of freedom
## Multiple R-squared:  0.01265,    Adjusted R-squared:  0.01107 
## F-statistic: 7.974 on 4 and 2489 DF,  p-value: 2.198e-06
# the Regression results is stored in the object m which is fed to the Anova function from the "car" package - installing but not loading car package.
# install.packages("car")
# Anova(m,type = 3)           # Result in an error if you dont load car.
a <- car::Anova(m, type = 3)
a
## Anova Table (Type III tests)
## 
## Response: total
##              Sum Sq   Df   F value    Pr(>F)    
## (Intercept)  998.21    1 2329.2286 < 2.2e-16 ***
## country       13.67    4    7.9736 2.198e-06 ***
## Residuals   1066.68 2489                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# If we load the package car, then we can use Anova directly.
library(car)
Anova(m,type = 3)
## Anova Table (Type III tests)
## 
## Response: total
##              Sum Sq   Df   F value    Pr(>F)    
## (Intercept)  998.21    1 2329.2286 < 2.2e-16 ***
## country       13.67    4    7.9736 2.198e-06 ***
## Residuals   1066.68 2489                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# in this case, with only one predictor, we could use the anova function from base R.
anova(m)
## Analysis of Variance Table
## 
## Response: total
##             Df  Sum Sq Mean Sq F value    Pr(>F)    
## country      4   13.67  3.4172  7.9736 2.198e-06 ***
## Residuals 2489 1066.68  0.4286                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the variable country has a low p-value. That means that we have good evidence of differences between the five countries in average Self Esteem score. What naturally happens after such results is that one will be interested in details about the difference between the three groups. The first step is to get the average Self Esteem for each country. We can get that from the regression model directly by using estimated marginal means (means for the outcome controlled by other variables in the model, which in this case are none).

# install.packages("emmeans")
ls <- emmeans::emmeans(m, specs =  ~ country)
ls
##  country emmean     SE   df lower.CL upper.CL
##  AU        2.45 0.0508 2489     2.35     2.55
##  CA        2.45 0.0464 2489     2.36     2.54
##  GB        2.35 0.0303 2489     2.29     2.40
##  IN        2.70 0.0661 2489     2.57     2.83
##  US        2.50 0.0165 2489     2.46     2.53
## 
## Confidence level used: 0.95

We see the estimated means for each country, their Standard Errors and Confidence Intervals. A simple way to test all pairwise differences between groups is to use the pairwise.t.test function.

pairwise.t.test(se$total, se$country, p.adj = "bonf")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  se$total and se$country 
## 
##    AU      CA      GB      IN     
## CA 1.00000 -       -       -      
## GB 0.71182 0.57885 -       -      
## IN 0.02814 0.01912 1e-05   -      
## US 1.00000 1.00000 0.00014 0.02578
## 
## P value adjustment method: bonferroni
# or from the model - this is advantageous if we have other variables in the model
ls <- emmeans::emmeans(m,specs = pairwise ~ country, adj = "bonferroni")
ls
## $emmeans
##  country emmean     SE   df lower.CL upper.CL
##  AU        2.45 0.0508 2489     2.32     2.58
##  CA        2.45 0.0464 2489     2.33     2.57
##  GB        2.35 0.0303 2489     2.27     2.42
##  IN        2.70 0.0661 2489     2.53     2.87
##  US        2.50 0.0165 2489     2.45     2.54
## 
## Confidence level used: 0.95 
## Conf-level adjustment: bonferroni method for 5 estimates 
## 
## $contrasts
##  contrast estimate     SE   df t.ratio p.value
##  AU - CA   0.00162 0.0688 2489  0.024  1.0000 
##  AU - GB   0.10681 0.0592 2489  1.805  0.7118 
##  AU - IN  -0.24938 0.0834 2489 -2.990  0.0281 
##  AU - US  -0.04371 0.0534 2489 -0.818  1.0000 
##  CA - GB   0.10519 0.0554 2489  1.897  0.5788 
##  CA - IN  -0.25100 0.0808 2489 -3.107  0.0191 
##  CA - US  -0.04533 0.0493 2489 -0.920  1.0000 
##  GB - IN  -0.35619 0.0728 2489 -4.896  <.0001 
##  GB - US  -0.15052 0.0345 2489 -4.357  0.0001 
##  IN - US   0.20567 0.0682 2489  3.017  0.0258 
## 
## P value adjustment: bonferroni method for 10 tests

The Tukey test is less rigid than Bonferroni, so it is also very popular. It needs an object of the class aov, which is also an analysis of variance object. We can see that the p-values are lower reflecting the fact that they have been through an adjustment that is less conservative.

TukeyHSD(aov(m))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = m)
## 
## $country
##               diff         lwr         upr     p adj
## CA-AU -0.001622571 -0.18946701  0.18622187 0.9999999
## GB-AU -0.106810647 -0.26833724  0.05471594 0.3707557
## IN-AU  0.249378466  0.02172862  0.47702832 0.0235335
## US-AU  0.043708808 -0.10216248  0.18958009 0.9251862
## GB-CA -0.105188077 -0.25651765  0.04614150 0.3188820
## IN-CA  0.251001037  0.03046928  0.47153279 0.0163765
## US-CA  0.045331378 -0.08916112  0.17982388 0.8892436
## IN-GB  0.356189114  0.15759515  0.55478308 0.0000103
## US-GB  0.150519455  0.05621389  0.24482502 0.0001337
## US-IN -0.205669659 -0.39175339 -0.01958593 0.0216836

10.9 Non-Parametric tests

Non-parametric test are so called because they don’t use distributional assumptions. The Normal distribution is a parametric distribution so that if tests don’t assume normality, or they don’t make distributional assumptions, they are called non-parametric. In general these tests convert the original data into ranks so that they are in general considered to be appropriate for ordinal data and they are robust to outliers in the data.

Follow the non-parametric tests for comparing two independent groups and two dependent groups, which are often used as replacements for respectively independent sample and paired sample t test.

# Test for Independent samples (Mann-Whitney U test)
wilcox.test(total ~ fgender, data = se, conf.int = TRUE)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  total by fgender
## W = 740154, p-value = 0.4806
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -2.223579e-05  9.995606e-02
## sample estimates:
## difference in location 
##           5.492816e-05
# Test for paired sample (Wilcox signed rank test)
wilcox.test(se$Q1, se$Q2, paired = TRUE, conf.int = TRUE)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  se$Q1 and se$Q2
## V = 116395, p-value = 2.831e-10
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
##  -6.463594e-07 -3.683559e-05
## sample estimates:
## (pseudo)median 
##  -2.181193e-05

The test for independent sample is known as Mann-Whitney U test, and the paired test is known as Wilcoxon signed rank test. These tests don’t compare the means, but the overall distribution of the data.

10.10 Non-parametric ANOVA (Kruskal-Wallis)

ANOVA assumes normality and equal variance in all groups. If the outcome departs from normal or is in a ordinal scale, you may want to use a non-parametric test equivalent to ANOVA. The Kruskal-Wallis test convert the outcome into ranks and do the calculation with the in a way that does not need the assumptions of ANOVA.

Usually when one finds evidence of association, pairwise non parametric comparison is conducted using the Mann-Whitney U test.

kruskal.test(total ~ country, data = se)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  total by country
## Kruskal-Wallis chi-squared = 33.376, df = 4, p-value = 1.001e-06
# all pairwise comparison with False Discovery Rate adjustment. We coould have used Bonferroni as well.
pairwise.wilcox.test(se$total, se$country, data = se, p.adjust.method = "fdr" )
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  se$total and se$country 
## 
##    AU      CA      GB      IN     
## CA 0.98687 -       -       -      
## GB 0.08145 0.08145 -       -      
## IN 0.00192 0.00192 7e-07   -      
## US 0.52253 0.51333 0.00015 0.00192
## 
## P value adjustment method: fdr

10.11 Pearson Correlation

Correlation coefficient is used to measure the linear association between two continuous variable. It varies between -1 and 1 with values close to zero indicating lack of association, values close to 1 indicating highly positive association and close to -1, highly negative.

cor(se[3:12], use = "pairwise.complete.obs")
##                Q2          Q3         Q4          Q5          Q6          Q7
## Q2      1.0000000  0.50886718  0.5824985  0.54858163  0.59278706  0.55373154
## Q3      0.5088672  1.00000000  0.4898404  0.66506166  0.61503561  0.61405765
## Q4      0.5824985  0.48984037  1.0000000  0.48063758  0.51908844  0.50182766
## Q5      0.5485816  0.66506166  0.4806376  1.00000000  0.59606873  0.61730602
## Q6      0.5927871  0.61503561  0.5190884  0.59606873  1.00000000  0.75473893
## Q7      0.5537315  0.61405765  0.5018277  0.61730602  0.75473893  1.00000000
## Q8      0.3335951  0.45655273  0.3315935  0.41761478  0.52079126  0.47708643
## Q9      0.4488153  0.59745471  0.4152896  0.55796456  0.55565990  0.53432898
## Q10     0.5050354  0.62206554  0.4688069  0.60831636  0.61844974  0.59575642
## Gender -0.1341288 -0.03088868 -0.1163173 -0.01489611 -0.08099194 -0.03587734
##                 Q8          Q9         Q10      Gender
## Q2      0.33359506  0.44881534  0.50503537 -0.13412875
## Q3      0.45655273  0.59745471  0.62206554 -0.03088868
## Q4      0.33159352  0.41528963  0.46880692 -0.11631735
## Q5      0.41761478  0.55796456  0.60831636 -0.01489611
## Q6      0.52079126  0.55565990  0.61844974 -0.08099194
## Q7      0.47708643  0.53432898  0.59575642 -0.03587734
## Q8      1.00000000  0.49516731  0.51743618 -0.06300764
## Q9      0.49516731  1.00000000  0.73454470 -0.07914198
## Q10     0.51743618  0.73454470  1.00000000 -0.08442783
## Gender -0.06300764 -0.07914198 -0.08442783  1.00000000
cor.test(se$Q1, se$Q2, use = "pairwise.complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  se$Q1 and se$Q2
## t = 52.381, df = 2481, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7054453 0.7428337
## sample estimates:
##       cor 
## 0.7246724

10.12 Non-Parametric Correlation

Pearson Correlation assumes that the data is continuous and normally distributed. On the other hand, the Spearman correlation is based in ranks and does not make assumptions about the distribution of the data.

# Non-parametric spearman correlation using Q1 to Q10 and Total columns.
cor(se[c(2:11,17)], use = "pairwise.complete.obs", method = "spearman")
##              Q1        Q2        Q3        Q4        Q5        Q6        Q7
## Q1    1.0000000 0.7228698 0.5561590 0.5770494 0.5539289 0.6420925 0.5921809
## Q2    0.7228698 1.0000000 0.5089362 0.5739217 0.5465199 0.5970203 0.5538785
## Q3    0.5561590 0.5089362 1.0000000 0.4881467 0.6669840 0.6162936 0.6147795
## Q4    0.5770494 0.5739217 0.4881467 1.0000000 0.4801464 0.5154190 0.4951914
## Q5    0.5539289 0.5465199 0.6669840 0.4801464 1.0000000 0.5991582 0.6206156
## Q6    0.6420925 0.5970203 0.6162936 0.5154190 0.5991582 1.0000000 0.7525397
## Q7    0.5921809 0.5538785 0.6147795 0.4951914 0.6206156 0.7525397 1.0000000
## Q8    0.4177706 0.3461277 0.4528411 0.3357588 0.4111551 0.5176401 0.4689061
## Q9    0.5136202 0.4652527 0.6061451 0.4213012 0.5635624 0.5623961 0.5403360
## Q10   0.5656571 0.5145294 0.6261112 0.4701251 0.6075385 0.6200533 0.5970020
## total 0.7301977 0.7074153 0.8078851 0.6598209 0.7969677 0.8295744 0.8140481
##              Q8        Q9       Q10     total
## Q1    0.4177706 0.5136202 0.5656571 0.7301977
## Q2    0.3461277 0.4652527 0.5145294 0.7074153
## Q3    0.4528411 0.6061451 0.6261112 0.8078851
## Q4    0.3357588 0.4213012 0.4701251 0.6598209
## Q5    0.4111551 0.5635624 0.6075385 0.7969677
## Q6    0.5176401 0.5623961 0.6200533 0.8295744
## Q7    0.4689061 0.5403360 0.5970020 0.8140481
## Q8    1.0000000 0.4831676 0.5099870 0.6467802
## Q9    0.4831676 1.0000000 0.7416450 0.7810573
## Q10   0.5099870 0.7416450 1.0000000 0.8318725
## total 0.6467802 0.7810573 0.8318725 1.0000000
# Alternatively we could use chained operations. Here we store the correlation in object a.
a <- se %>% select(Q1:Q10,total) %>% cor(use = "pairwise.complete.obs", method = "spearman")

# Test whether correlation is significant
cor.test(se$Q1, se$Q2, use = "pairwise.complete.obs", method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  se$Q1 and se$Q2
## S = 707070552, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7228698
# The corrplot package allows you to create nice plots with correlation matrices.
# install.packages("corrplot")
corrplot::corrplot(a)

11 Linear Regression Models

An example of using function lm in base R to run a linear regression model. The operator “:” indicates an interaction in the model and the operator "*" indicates the interaction plus the lower levels main effects.

# Store regression results in object m1.
m1 <- lm(total ~ fgender + country + fgender:sc + Age*sc, data = se)
# Look at the results
summary(m1)
## 
## Call:
## lm(formula = total ~ fgender + country + fgender:sc + Age * sc, 
##     data = se)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.78638 -0.46801 -0.02994  0.46986  1.57929 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.0786067  0.0593794  35.006  < 2e-16 ***
## fgenderFemale           0.0125924  0.0271819   0.463  0.64321    
## countryCA              -0.0261321  0.0662529  -0.394  0.69330    
## countryGB              -0.1347730  0.0571626  -2.358  0.01847 *  
## countryIN               0.2594851  0.0804367   3.226  0.00127 ** 
## countryUS               0.0265702  0.0514879   0.516  0.60587    
## Age                     0.0139022  0.0009885  14.063  < 2e-16 ***
## scGoogle                0.0537717  0.1521285   0.353  0.72377    
## fgenderFemale:scGoogle  0.0599370  0.1030846   0.581  0.56100    
## scGoogle:Age           -0.0066610  0.0042298  -1.575  0.11544    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6298 on 2484 degrees of freedom
## Multiple R-squared:  0.088,  Adjusted R-squared:  0.0847 
## F-statistic: 26.63 on 9 and 2484 DF,  p-value: < 2.2e-16
# Test each effect using omnibus ANOVA table. 
car::Anova(m1, type = 3)
## Anova Table (Type III tests)
## 
## Response: total
##             Sum Sq   Df   F value    Pr(>F)    
## (Intercept) 486.05    1 1225.3858 < 2.2e-16 ***
## fgender       0.09    1    0.2146    0.6432    
## country      16.13    4   10.1683 3.646e-08 ***
## Age          78.45    1  197.7790 < 2.2e-16 ***
## sc            0.05    1    0.1249    0.7238    
## fgender:sc    0.13    1    0.3381    0.5610    
## sc:Age        0.98    1    2.4799    0.1154    
## Residuals   985.28 2484                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12 Logistic Regression Models

Here we use function glm to fit a logistic regression model to a binary outcome in order to model the probability of having low self esteem. It is all similar to the linear model, except that here we need to specify the distribution binomial for the logistic regression model.

m2 <- glm(Low_se ~ fgender + country + fgender:sc + Age*sc, data = se,
          family = binomial)
summary(m2)
## 
## Call:
## glm(formula = Low_se ~ fgender + country + fgender:sc + Age * 
##     sc, family = binomial, data = se)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6859   0.2442   0.6053   0.8457   1.2310  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -0.198696   0.237550  -0.836   0.4029    
## fgenderFemale           0.007440   0.105501   0.071   0.9438    
## countryCA              -0.104578   0.252363  -0.414   0.6786    
## countryGB              -0.300888   0.215844  -1.394   0.1633    
## countryIN               0.819475   0.362750   2.259   0.0239 *  
## countryUS              -0.042171   0.196328  -0.215   0.8299    
## Age                     0.055684   0.005303  10.501   <2e-16 ***
## scGoogle               -0.212637   0.550119  -0.387   0.6991    
## fgenderFemale:scGoogle  0.590868   0.361264   1.636   0.1019    
## scGoogle:Age           -0.021157   0.016767  -1.262   0.2070    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2761.3  on 2493  degrees of freedom
## Residual deviance: 2590.1  on 2484  degrees of freedom
## AIC: 2610.1
## 
## Number of Fisher Scoring iterations: 5
car::Anova(m2, type = 3)
## Analysis of Deviance Table (Type III tests)
## 
## Response: Low_se
##            LR Chisq Df Pr(>Chisq)    
## fgender       0.005  1   0.943785    
## country      14.954  4   0.004797 ** 
## Age         147.591  1  < 2.2e-16 ***
## sc            0.150  1   0.698336    
## fgender:sc    2.667  1   0.102432    
## sc:Age        1.509  1   0.219331    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

13 Mixed Models

Mixed Models are often used at CAMH when we have repeated measure data. Such data is called longitudinal data if the repeated measures are over time, as when we follow subjects for a time measuring them at different time points. Other times like when we measure multiple brain regions in the same subject, the repeated measure is not in time. Statistically, both situations are similar as the repeated measurements within the same sample unit (usually subjects) causes the data to have a clustering structure where measures within the cluster are more similar to each otehr (correlated) as compared to measures from different clusters.

The subjects will usually be the clusters at CAMH. In that sense, as an example, depression measures within the same patient at different time points will be correlated. This correlation needs to be accounted for in regression model and the simplest way to do that is by declaring the clusters (subjects) as random effects.

In our example dataset, we don’t have repeated measures within the same subject (a metric is only measured once within each subject), but we have countries. Countries can be seen as clusters in that measures within a country will tend to be more similar than measures across countries. This also happens when we collect data from patients over multiple centers: the centers are usually considered clusters and modeled as random effects.

The function lmer from package lme4 is the most popular way to run mixed models in R. Here we have the same model as above, except that we specify country as random effect (clusters). SInce p-values for mixed models are a bit controversion, the implementation of lme4 package does not calculate p-values. However, if you use the lmerTest package, the same function lmer will followed by anova will give you p-values.

Notice that in a dataset like this you should control for countries either by declaring them as random effects or fixed effects (just add them to the linear model above). The definition of whether something should be specified as random effect is a bit complicated and without much consensus. Usually, if a variable can be seen as clustering structure (units within it will be more similar than units across it) or sampling clusters (if you sample schools, then students within schools, then schools are clusters), then you declare it as random effect. But this is a very basic rule as this workshop does not intend to teach mixed models.

m3 <- lme4::lmer(total ~ fgender + fgender:sc + Age*sc +
                   (1|country), data = se)
summary(m3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: total ~ fgender + fgender:sc + Age * sc + (1 | country)
##    Data: se
## 
## REML criterion at convergence: 4816.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.83616 -0.74271 -0.04749  0.74460  2.49457 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  country  (Intercept) 0.01691  0.1300  
##  Residual             0.39669  0.6298  
## Number of obs: 2494, groups:  country, 5
## 
## Fixed effects:
##                          Estimate Std. Error t value
## (Intercept)             2.0986130  0.0698573  30.041
## fgenderFemale           0.0115446  0.0271742   0.425
## Age                     0.0138729  0.0009884  14.035
## scGoogle                0.0532503  0.1521219   0.350
## fgenderFemale:scGoogle  0.0602356  0.1030779   0.584
## scGoogle:Age           -0.0066403  0.0042294  -1.570
## 
## Correlation of Fixed Effects:
##             (Intr) fgndrF Age    scGogl fgnF:G
## fgenderFeml -0.269                            
## Age         -0.415  0.082                     
## scGoogle    -0.130  0.127  0.191              
## fgndrFml:sG  0.072 -0.264 -0.020 -0.534       
## scGoogle:Ag  0.099 -0.021 -0.232 -0.838  0.117
# No p-value and Type II test.
anova(m3)
## Analysis of Variance Table
##            npar Sum Sq Mean Sq  F value
## fgender       1  0.149   0.149   0.3744
## Age           1 78.262  78.262 197.2892
## sc            1  1.479   1.479   3.7279
## fgender:sc    1  0.237   0.237   0.5970
## sc:Age        1  0.978   0.978   2.4650
# To get type 3 test we use lmer function from lmerTest package
m3 <- lmerTest::lmer(total ~ fgender + fgender:sc + Age*sc +
                   (1|country), data = se)
summary(m3)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: total ~ fgender + fgender:sc + Age * sc + (1 | country)
##    Data: se
## 
## REML criterion at convergence: 4816.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.83616 -0.74271 -0.04749  0.74460  2.49457 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  country  (Intercept) 0.01691  0.1300  
##  Residual             0.39669  0.6298  
## Number of obs: 2494, groups:  country, 5
## 
## Fixed effects:
##                          Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)             2.099e+00  6.986e-02  6.043e+00  30.041 8.25e-08 ***
## fgenderFemale           1.154e-02  2.717e-02  2.486e+03   0.425    0.671    
## Age                     1.387e-02  9.884e-04  2.485e+03  14.035  < 2e-16 ***
## scGoogle                5.325e-02  1.521e-01  2.484e+03   0.350    0.726    
## fgenderFemale:scGoogle  6.024e-02  1.031e-01  2.485e+03   0.584    0.559    
## scGoogle:Age           -6.640e-03  4.229e-03  2.485e+03  -1.570    0.117    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) fgndrF Age    scGogl fgnF:G
## fgenderFeml -0.269                            
## Age         -0.415  0.082                     
## scGoogle    -0.130  0.127  0.191              
## fgndrFml:sG  0.072 -0.264 -0.020 -0.534       
## scGoogle:Ag  0.099 -0.021 -0.232 -0.838  0.117
anova(m3)
## Type III Analysis of Variance Table with Satterthwaite's method
##            Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
## fgender    0.2594  0.2594     1 2484.3  0.6540    0.4188    
## Age        9.8647  9.8647     1 2485.1 24.8679 6.563e-07 ***
## sc         0.1583  0.1583     1 2484.4  0.3990    0.5277    
## fgender:sc 0.1355  0.1355     1 2484.7  0.3415    0.5590    
## sc:Age     0.9778  0.9778     1 2484.9  2.4650    0.1165    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1