---
title: "Basic Tests"
author: "Marcos Sanches"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
---
```{r setup, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
# Objective
This workshop will start with some slides on statistical inference, then we will proceed with some basic tests in R.
**Statistical Inference** - Statistical tests and estimation are key in statistical analysis. This is what allows you to claim that your results are balid beyond your sample of subjects. If you find that a treatment works in a clinical trial, statistical inference is what allows you to have confidence the treatment should work for anyone with the condition, not only those subjects you studied.
It is important that we are aware of the meaning of statistical inference and certain statistical terms often used like sample, p-value and confidence interval. Statiscal analysis is more than just being able to do it in R.
Below you have a series of basic tests in R, which we will probably not be able to cover fully in the workshop, since we will spend some time with statistical inference. But we provide the material with the hope that you are able to go over it on your own and play around with it to learn the script.
The goal today is to introduce you to the concepts of statistical inference and to basic statistical tests as well as interpretation of result sin terms of p-values and confidence intervals.
# A new RStudio project
Here is the steps to create a new RStudio project:
1. Create a folder for each data analysis project.
2. Open Rstudio.
3. Go to File > New Project > Existing Directory
4. Navigate to the folder you created
5. Click Create Project. A “…Rproj” file will be created in the folder. Next time, you double click this file to open your project.
6. Go to File > New File > R Markdown (or R Script) and click Ok.
7. Save the new R Mardown file. Here you will store your R script.
8. Make sure you store all your scripts. Ideally, you want to have in scrip all single steps from reading the data to final results.
# Cleaning R environment
Again we clean the R environment - we can do that because we are confident the script we will create will reproduce all relavant analysis, we dont need to keep any dataset or analysis results other than the original dataset!
```{r}
rm(list = ls())
```
# Reading a SPSS dataset
For today's workshop we will use the SPSS dataset "Headache Wide.sav". We read it with package *haven*.
This data is from randomized crossover study comparing two analgesic drugs (A and B)
and placebo (P) for relief of tension headaches.
Reference: Laird, N.M., Skinner, J. and Kenward, M. (1992). An analysis of
two-period crossover designs with carry-over effects. Statistics in Medicine,
11, 1967-1979).
Description:
The study was designed to compare two active drugs and placebo for relief of tension headache. The two analgesic drugs were identical in their active ingredients except that one contained caffeine. The primary comparison of interest was between the two active drugs; the placebo was included for regulatory purposes.
Note that there were three treatments, but only two periods, i.e., each subject received only two of the three treatments in a random order. With three treatments and two periods, there are six possible treatment sequences, AB, BA, AP, PA, BP, PB, where A, B and P denote the two active drugs and placebo. In this study the AB and BA sequences were assigned three times as many subjects as the remaining four because of the interest in the A versus B comparison.
Note: Two headaches treated within each period and response is the average pain relief for both headaches.
Note - We will use this dataset to go over different statistical tests in R, our obejctive is not to answer the research question of this study.
```{r}
dt <- haven::read_sav("Self Esteem Dataset - V3.sav")
names(dt)
```
# Load *tidyverse*
*tidyverse*is a set of packages idealized by *Hadley Wikham* and you can get more information about it at [Wikipedia](https://en.wikipedia.org/wiki/Tidyverse) and in many other places in the internet.
By installing and loading *tidyverse* you will load *dplyr* and *ggplot2*.
```{r}
library(tidyverse)
```
As the data came from SPSS, we can use *as_factor* to transform some variable sin factor and make interpretation easier.
```{r}
dt2 <- dt %>%
mutate_at(vars(Q1:Gender, Age_group, ct, low.se), as_factor)
```
# Proportions
We will review a range of basic statistical test in today's workshop.
## Single Proportion
This is for the case when you have only one proportion and you want to either estimate it, or test if it is significantly different from a constant.
Lets take as example the proportion of low self esteem individuals int he dataset.
```{r}
# Number of subjects with low self esteem (variable low.se = 1)
t1 <- table(dt2$low.se)
t1
prop.table(t1)
# Estimation and test for the proportions being the same (that is, p = 0.5)
# It uses the propotion of the first element.
prop.test(t1)
binom.test(t1)
# Estimation and test for propotiosn being equal to a certain value, in this case 0.77. The prop.test function will test the first proportion only.
prop.test(t1, p = 0.77)
binom.test(t1, p = 0.77) # The binominal test is exact, so it is better
# If you want to estimate the proportion of low self esteem you have to reverse the order of the levels.
# We can use the factor function to redefine low.se as a factor and state the
# label in the order we want.
prop.test(table(factor(dt2$low.se, levels = c("Low Self-Esteem (SE<=20)",
"Normal Self-Esteem (SE>20)"))))
# We could also look at 2 - as.numeric(low-se). The as.numeric function applied
# to a factor transforms it in numeric 1, 2, 3...
prop.test(table(2 - as.numeric(dt2$low.se)))
```
## Comparing two proportions
Lets think about the proportion of low self esteem for males and females and whether they are different.
One way to do that is to look at the difference in proportions and compare with zero.
p1 - p2 = 0
```{r}
# Point estimate for the proportion
prop.table(table(dt2$Gender, dt2$low.se),1)
# There is a "Other" level in Gender that has no sample size, we remove it.
dt2$Gender <- droplevels(dt2$Gender)
# We will also reverse the low.se labels so that always look at the proportion
# of low self esteem rather then the proportion of not low self esteem.
dt2$low.se <- factor(dt2$low.se, levels = c("Low Self-Esteem (SE<=20)",
"Normal Self-Esteem (SE>20)"))
# Roe percent - Compare % low self esteem between males and females.
prop.table(table(dt2$Gender, dt2$low.se),1)
t <- table(dt2$Gender, dt2$low.se)
# Test and confidence interval for the Risk Difference (difference of proportions)
prop.test(t)
```
A Chi-square test will result in the same test for the case of 2 by 2 tables, but the interpretation is usually different. Because the chi-square does not focus on the proportions, we think of it as testing the associaiton between Gender and Low self esteem.
If there is an association, it means that hte proportion of low self esteem is different across genders, or the proortion of genders is different across levels of self esteem.
```{r}
# Popular Chi-Square Test
chisq.test(t)
```
The Fisher's test is interpreted as teh Chi-square test. But in R it provides the comparison of odds as well, which is the odds ratio. The odds of low self esteem among males divided by the odds of low self esteem among females provide a way to test if the ration of odds is 1, and to get a confidence interval for the odds ratio.
Odds low self esteem among males = p1 / (1 - p1)
Odds low self esteem among females = p2 / (1 - p2)
Odds Ratio = (p1 / (1 - p1)) / (p2 / (1 - p2))
```{r}
# Odds Ratio and Fisher's Exact test
fisher.test(t)
```
And another way to compare two proportion is by dividing one by the other, which we call *relative risk*. We have the risk of low self esteem for males and for females, and we calculate the risk ration by dividing one by the other.
Risk of low self esteem for males = p1
Risk of low self esteem for females = p2
Relative Risk or Risk Ratio = p1/p2
Here we use the *epitools* package. In the "measure" table we get the relative risk for female relative to males.
There is a catch, though. If you look at the help for the *riskratio* function (?riskratio) you will see that it will look at the prooprtion in the second column, not in the first. But Low Self Esteem is the first column. We need then to specify that we want to reverse the columns. Likewise, we could reverse the rows if we wanted to look at males relative to females.
Notice that the statistical test also uses the Chi-square and Fisher's test. But here we get a confidence interval for the relative risk (estimation).
```{r}
library(epitools)
t <- table(dt2$Gender, dt2$low.se)
riskratio(t, rev = "columns")
```
## Larger Tables
Above we have seen 2 by 2 tables, which can be seen as comparison of proportions. When tables have more than 2 rows or more than 2 columns the test is more commonly referred to as test for association.
The most popular are the Chi-square and the Fisher's test. The Fisher's test is exact and it is preferred when sample size is small, particularly if there are cells with expected count less than 5.
Here we take a look at country by age group.
```{r}
# There is a column in Age_group that is all zeros, so we remove it.
table(dt2$Age_group)
dt2$Age_group <- droplevels(dt2$Age_group)
# The table
t <- table(dt2$country, dt2$Age_group)
t
# The chi-square test
chisq.test(t)
# We look at expected values to see if there are counts lower than 5, case
# in which chi-square may not work well.
cq <- chisq.test(t)
cq$expected
```
To interpret a significant table you can look at proportions. The Chi-square contribution can also help.
```{r}
# Proportions of age within county - row pct
# We itnerpret the associaton in the table by looking at the proportions.
prop.table(t, 1)
# Here is the chi-square contribution. THe very large prooprtion of 25 to 34
# in India is what contribute the most to the chi-square.
gmodels::CrossTable(t, prop.r = F, prop.c = F, prop.t = F, prop.chisq = TRUE)
```
We can also test the associations int he table using Fisher's exact test. If sample size is large or table is large, we may need to use simulations, because Fisher's test is quite computer intensive.
```{r}
# The Fisher's test. Error, table too big.
# fisher.test(t)
# So we use simulation with 1,000,000 replications, which shoudl be good enough.
fisher.test(t, simulate.p.value = TRUE, B = 1000000)
```
If we want to estimate rather than test, the Desctools package can give us confidence interval for the percent inthe tables.
```{r}
# Percents of young in each country (first column of t divided by the sum of all
# columns in t)
DescTools::BinomCI(t[,1],rowSums(t))
# Lkeside, percent of CA in each age group.
DescTools::BinomCI(t[2,],colSums(t))
```
Using the estimation framework, it is usually a good idea to plot results like these. Lets plot the proportion of youth for each country.
```{r}
# calculate the proportion of youth per country and transform in dataframe.
plotdata <- DescTools::BinomCI(t[,1],rowSums(t)) %>%
as.data.frame() %>%
mutate(Country = rownames(t))
plotdata
ggplot(plotdata,aes(x = Country, y = est))+
geom_point(size = 2)+
geom_errorbar(aes(ymin = lwr.ci, ymax = upr.ci), width = 0.1)+
ylim(0,0.7)+ # it is always good to start the y axis from zero
ylab("% < 25 years old")+
theme_bw()
```
# Means
The tests above are for proportions, now we will look sat some tests for means.
## One sample t-test
Let's say we want to test if the mean self esteem is different from 25. Th t.test function can test that, and also give you a confidence interval for the mean
```{r}
# One sample t test. Confidence interval quite narrow because of large sampel size.
t.test(dt2$se, mu = 25)
# A smaller sample would give us a larger confidence itnerval.
t.test(sample_n(dt2,100)$se, mu = 25)
```
## Independent sample t-test
Used when the goal is to compare the means of two independent samples. Usually independent means two groups of subjects where the subjects are not the same in both groups. For example, the groups could be two genders, or two countries, or two age groups that you may want to compare.
Entering the data for the *t.test* function can look a bit tricky. One way of doing it is by making use of the formula notation in R, which specifies the dependent variable (continuous variable), a tilde (~) and the independent variable (grouping variable). A perhaps easier to understand way is to just give *t.test* the two vectors of values which means we want to compare.
We get a p-value (testing framework) and a confidence interval (estimating framework).
```{r}
# Using the dep ~ indep formula notation.. Comparing genders.
t.test(se ~ Gender, data = dt2)
# Same test, but now we given the function the vectors we want to compare
t.test(dt$se[dt2$Gender == "Male"],
dt$se[dt2$Gender == "Female"])
# Comparing mean self esteem for two countries.
t.test(se ~ country, data = filter(dt2, country == "CA" | country == "GB"))
# Calculating means and SE by country to report in a paper.
dt2 %>% group_by(country) %>%
summarise(mean = mean(se, na.rm = T),
sd = sd(se, na.rm = TRUE),
N = n(), # The n() function counts missing values as well.
N.valid = sum(!is.na(se))) %>% # here we sum non-missing values only
mutate(se = sd/sqrt(N.valid),
lower.ci.95 = mean - 1.96*se,
upper.ci.95 = mean + 1.96*se)
```
The traditional t-test for two independent samples will assume equal variance in both samples, and this assumption is easily violated. So, the *t.test* in R will by default not make that assumption, it will adjust the t-test so that it can work well with unequal variances. For that reason, it is called Welch t-test.
## Mann-Whitney U test
Also called WIlcox Rank-Sum test, thi sis a non-parametric test (meaning it does not assume normality).
The independent sample t-test assumes normality and continuous measure. The MW test does not. Therefore, the MW test is more robust to non-normal data and other data like ordinal data. Also, the MW test can be exact or approximate. When exact, it is valid for small sample sizes while the t-test is only valid for small sample under the assumption of normality. If data has ties (is not really continuous) then the exact version of the test cannob be calculated
The MW test does not compare means, instead it test if two samples come from the same population.
In practice, using the MW is conservative, safer, but even in cases when distribution is not normal the t-test tends to have similar performance as the MW test.
In the estimation framework, the confidence interval given by the test is called difference in location and it is the median of the difference. The calculation of this is confusing and an alternative is to stick to means and t-test if you want to use the estimation framework, while using the MW test if you want to test for differences.
```{r}
# Test for Independent samples (Mann-Whitney U test)
# Exact version cannot be calcuated.
wilcox.test(se ~ Gender, data = dt, conf.int = TRUE, exact = TRUE)
# Using for a ordinal variable.
wilcox.test(Q1 ~ Gender, data = dt, conf.int = TRUE)
```
## Paired t-test
The paired t-test is used when the means to be compared are collected in the same sample (usually in the same patients). The most common example of this situation at CAMH are pre-post samples where you want to compare the mean at pre and post (that is, you want to see how the mean changed over time on the same subjects). It can also happen when two raters rate the same patient and you want to compare their mean rating.
In order to demonstrate the test we will compare the mean of Q1 and Q2, which are two questions completed by the same subjects, so we need paired t-test in order to compare them. Q1 and Q2 are ordinal non-normal data, and the t-test would not be advised, although it can work well.
The test will calculate the difference between Q1 and Q2 for each subject, then test if the mean of the difference is zero. It will also give you a confidence interval for that difference.
```{r}
# We need to specify that we want the paired test.
t.test(dt$Q1, dt$Q2, paired = TRUE)
```
## Wilcoxon Sign-Rank test
This is the non-parametric equivalent to the paired t-test which you could use in general to be safer, particularly with ordinal and non-normal data.
The test returns a pseudo-median and a confidence interval for the difference. These are hard to interpret and often the mean and confidence intervals will be reported. For ordinal data like ours, one may just report proportion of subjects in each level.
```{r}
# Test for paired sample (Wilcoxon signed rank test)
wilcox.test(dt$Q1, dt$Q2, paired = TRUE, conf.int = TRUE)
```
## One-Way ANOVA
ANOVA stands for Analysis of Variance. It is usually applied when we want to compare means across three or more groups, although in cases when there are only two groups it reduces to the two independent sample t-test. So, we can say it is a generalization of the independent sample t-test that can be applied to more than two means.
If you get a significant p-value in the ANOVA it means that there are some group difference, that is, evidence that he group means are not all equal.
Anova, like t-test, assumes equal variance. It is better than to use the oneway.test function which has the option not to assume equal variance in the groups.
```{r}
# Here we have three ways to run an ANOVA>
a1 <- aov(se ~ country, data = dt2)
summary(a1)
a2 <- oneway.test(se ~ country, data = dt2, var.equal = TRUE)
a2
# Here we have a regression model (linear model), which generalizes ANOVA.
a3 <- lm(se ~ country, data = dt2 )
anova(a3)
# Not assuming equal variance in different countries. Preferred.
a4 <- oneway.test(se ~ country, data = dt2, var.equal = FALSE)
a4
```
The ANOVA being significant means evidence that the mean self esteem is different across countries. The question is then what is the difference, which country is different from which?
This is a situation where the estimation framework is quite nice since it avoids the use of multiple p-values. The *emmeans* function in the *emmeans* package can give you means and confidence intervals for each group. THe interpretation is made much easier by an errorbar plot
```{r}
em <- emmeans::emmeans(a1, specs = ~country)
em
ggplot(as.data.frame(em), aes(x = country, y = emmean))+
geom_point(size = 2)+
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.1)+
ylim(0,31)+ # Starting the yaxis at 0 gives correct perspective of the difference.
ylab("Mean Self Esteem and 95% CI")+
theme_bw()
ggplot(as.data.frame(em), aes(x = country, y = emmean))+
geom_point(size = 2)+
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.1)+
# ylim(0,31)+ # Not starting yaxis at zero magnifies differences.
ylab("Mean Self Esteem and 95% CI")+
theme_bw()
```
By looking at the graph above we have a good idea of the differences between countries and hor reliable the means are for each country. We can try to draw a conclusion from that.
The alternative is the testing framework, where we will test every pair of countries with a t-test. When we use multiple p-values we get into a complicated situation in statistics, which is that the usual interpretaiton of p-values is no longer valid. To be valid to some extent, you need to apply some adjustment, but there are several different adjustments that can produce quite different results. The most conservative is Bonferroni adjustment for p-values.
Here you have a test comparing all pair of countries.
```{r}
pairwise.t.test(dt$se, dt$country, p.adj = "bonf")
```
Because Bonferroni is conservative (ensures you dont get a significant p-value unless you have good evidence for the difference), you pay a big price for it, which is that you lose power (you may not find anything significant). We have a large sample here, so that losing power is less of a concern as we may have "too much" of it. However, this is quite problematic with small samples. Notice that the estimation framework does not use p-values, hence you dont have this problem.
The question is: Do we really need to tell which country is different from which with statistical justification? What will be done with the countries that are different?
Sometimes we dont need to pinpoint which countries are different of which simply because nobody will do anything with that. Maybe more studies will be carried out anyway to get more evidence and better understand the difference. Sometiems actions needs to take place for all pair of countries, whether the difference is significant or not. In both these cases a estimation framework would suffice.
The above said, if we are still focused on testing frameowrk and we have concerns with power, perhaps because our sample size is not too large, we can use other methods of adjustment for multiple tests. Here is Tukey HSD method. You can see that the p-values are slightly lower than the ones adjusted using Bonferroni method. This method needs the *aov* output.
```{r}
TukeyHSD(a1)
```
The Flse Discovery Rate is even less consdrvative and have more power (at the expense of type I error). This option is appealing when we have a very large number of tests.
The pariwise.t.test has other multiple comparison adjustment methods.
```{r}
pairwise.t.test(dt2$se, dt2$country, p.adj = "fdr")
```
## Kruskal Wallis test
The non-parametric equivalent to ANOVA is the Kruskal Wallis. It is robust to non-normality and non-continuous data, so it is safer. Like ANOVA, there is also a pairwise version of it, which reduces to the Mann-Whitney test we saw above.
```{r}
kruskal.test(se ~ country, data = dt2)
pairwise.wilcox.test(dt2$se, dt2$country, p.adjust.method = "bonf" )
# Comparing with t-test. In this case we see that the conclusions in terms of
# statistical significance would be the same.
pairwise.t.test(dt2$se, dt2$country, p.adj = "bonf")
```
# Correlation
## Pearson Correlation
Correlation coefficient is used to measure the **linear association** between two continuous variable. It varies between -1 and 1 with values close to zero indicating lack of association, values close to 1 indicating highly positive association and close to -1, highly negative.
We look at the pearson correlation between Age and Self Esteem. We can test them and also obtain a confidence interval. Usually the presence of missing value is dealth with by selecting cases that are complete for each pair of variable.
```{r}
# Here we just calculate the correlation between two variables.
cor(dt2$Age, dt2$se, use = "pairwise.complete.obs")
# Here we test them.
cor.test(dt2$Age, dt2$se, use = "pairwise.complete.obs")
# Here we have a correlation matrix between many variables.
cor (select(dt,Q1:Q10), use = "pairwise.complete.obs")
```
Since correlation is a measure of *linear* association, it is always good to check if the association is in fact linear, which can be done visually with a scatterplot.
Here we fit a smoothing loess line, as well as a linear model. The correlation represents the slope of the linear model. We see that the linear model is probably good enough, there is really little departure of it in the loess line.
```{r}
ggplot(dt2, aes(x = Age, y = se))+
geom_point(color = "gray80")+
stat_smooth(se = F, color = "black")+
stat_smooth(method = "lm", se = F, linetype ="dashed", color = "black")+
theme_minimal()
# Point will fall on top of each other. We can use a bubble chart where the size
# of the dots is proportional to the number of points there.
sdt <- group_by(dt2,Age, se) %>%
summarise(N = n()) %>%
ungroup()
# Now we need to weight the lines by N.
ggplot(sdt, aes(x = Age, y = se))+
geom_point(color = "gray80", aes(size = N))+
stat_smooth(se = F, color = "black", aes(weight = N))+
stat_smooth(method = "lm", se = F, linetype ="dashed",
color = "black", aes(weight = N))+
theme_minimal()
```
## Non-Parametric Correlation
Pearson Correlation assumes that the data is continuous and normally distributed. On the other hand, the Spearman correlation is based in ranks and does not make assumptions about the distribution of the data. So again here, you are usually safer using the Spearman correlation.
```{r}
# Here we just calculate the correlation between two variables.
cor(dt2$Age, dt2$se, use = "pairwise.complete.obs", method = "spearman")
# Here we test them. This can take a long time for large samples.
cor.test(dt2$Age, dt2$se, use = "pairwise.complete.obs", method = "spearman")
# Here we have a correlation matrix between many variables.
cor (select(dt,Q1:Q10), use = "pairwise.complete.obs", method = "spearman")
```
That is it for this workshop. There are certainly a number of otehr basic tests performed by R in different packages, but the one we went through are the most popular and should address most of your needs. Also, if you are comfortable with the tests above, you should have less difficulties learning new ones.
Thank you!
Marcos