---
title: "Linear Regression"
author: "Marcos Sanches"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
toc_dept: 4
---
```{r setup, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
# Objective
This workshop will introduce Linear Regression models. THe goal is to give the user some understanding of regression models, their interpretation, the definition of confouncers, interaction and estimated marginal means. We will also look into model diagnosis and graphs for visualizing models.
# A new RStudio project
Here is the steps to create a new RStudio project:
1. Create a folder for each data analysis project.
2. Open Rstudio.
3. Go to File > New Project > Existing Directory
4. Navigate to the folder you created
5. Click Create Project. A “…Rproj” file will be created in the folder. Next time, you double click this file to open your project.
6. Go to File > New File > R Markdown (or R Script) and click Ok.
7. Save the new R Mardown file. Here you will store your R script.
8. Make sure you store all your scripts. Ideally, you want to have in scrip all single steps from reading the data to final results.
# Cleaning R environment
Again we clean the R environment - we can do that because we are confident the script we will create will reproduce all relavant analysis, we dont need to keep any dataset or analysis results other than the original dataset!
```{r}
rm(list = ls())
```
# Reading a SPSS dataset
For today's workshop we will use the SPSS dataset "Headache Wide.sav". We read it with package *haven*.
This data is from randomized crossover study comparing two analgesic drugs (A and B)
and placebo (P) for relief of tension headaches.
Reference: Laird, N.M., Skinner, J. and Kenward, M. (1992). An analysis of
two-period crossover designs with carry-over effects. Statistics in Medicine,
11, 1967-1979).
Description:
The study was designed to compare two active drugs and placebo for relief of tension headache. The two analgesic drugs were identical in their active ingredients except that one contained caffeine. The primary comparison of interest was between the two active drugs; the placebo was included for regulatory purposes.
Note that there were three treatments, but only two periods, i.e., each subject received only two of the three treatments in a random order. With three treatments and two periods, there are six possible treatment sequences, AB, BA, AP, PA, BP, PB, where A, B and P denote the two active drugs and placebo. In this study the AB and BA sequences were assigned three times as many subjects as the remaining four because of the interest in the A versus B comparison.
Note: Two headaches treated within each period and response is the average pain relief for both headaches.
Note - We will use this dataset to go over different statistical tests in R, our obejctive is not to answer the research question of this study.
```{r}
dt <- haven::read_sav("Headache Wide V2.sav")
names(dt)
```
# Load packages
Here we load the packages that we will use in this workshop.
```{r}
library(tidyverse)
library(emmeans)
library(haven)
library(ggthemes)
library(sjstats)
library(sjPlot)
library(kableExtra)
library(sandwich)
library(lmtest)
library(estimatr)
library(splines)
```
As the data came from SPSS, we can use *as_factor* to transform some variable sin factor and make interpretation easier. We also give the variables better names. It is important to ensure the variables we will use are in proper format.
```{r}
dt2 <- dt %>%
mutate(Province = as_factor(prov),
Sequence = as_factor(seq),
Treatment = as_factor(treat.1),
Treatment2 = as_factor(treat.2),
Center = factor(Center)) %>%
rename(Pain.Score = Score1, Pain.Score2 = Score2) %>%
select(-prov, -seq, -treat.1, -treat.2)
```
# Causal Models
This workshop will focus on causal regression models, which are those models where you want to estimate regression coefficients to possibly interpret as effects. These models are carefully built according to the research question and incorporates the information that we have about the real world. They are models in the sense that they model the real world. For this reason this model does not include all information in the dataset and the included information is chosen based on subject matter theory.
A predictive regression model is less interested in effect of particular variables, and more interested in the performance of the full model, as measured by the R-squared or classification accuracy. Usually these models will include most or all available information in the dataset, most of the time with disregard for how all this information interact inreal life.
Within the causal model framework we have well defined concepts like moderators, mediators and confounders. The concepts dont apply to predictive models.
## Research Question 1
Let's assume for now that our research question is: *Does treatment group (A/B/P) at time 1 affects Pain Relief Scores?* Here we are using only time 1 for the sake of creating a simple model. Since this is a randomized crossover study, looking only at time 1 and ignoring time 2 will still give us information for answering the research question, although with less power.
### Regression Model
If a regression model is properly built and represents well the real world, where the dependent variable is caused by the independent variables, then we can interpret association as effects. This is particularly true for a randomized trial like the one we have.
Here we have an ANOVA test for the effect of the treatment in general. The significant p-value indicates evidence in favor of a treatment effect, that is, means Pain Relief differ across treatment groups.
```{r}
m1 <- lm(Pain.Score ~ Treatment, data = dt2)
anova(m1)
```
### Regression Coefficients
Here is the regression model and its coefficients, using the "summary" function. We can think of this model as "what is the **effect** of treatment on pain scores?". If the data is not randomized and we are unsure we have the most important controls, we should say "association" isntead of "effect".
```{r}
m1 <- lm(Pain.Score ~ Treatment, data = dt2)
summary(m1)
```
Treatment being a **categorical variable** (remember that we defined it as a factor) will have one level as reference and the effect of the other levels relative to the reeference. Here the reference is Treatment A. The effect of treatment B relative ot A is -0.5 (p = 0.21 which means not very strong evidence this difference is real). That is, the average Pain Score for B is -0.5 points from A. And the Placebo is -2.3 from A, and that is significant (p<0.0001).
But it would probably be better to compare A and B with Placebo
```{r}
dt2$Treatment <- relevel(dt2$Treatment,ref = "P")
m1 <- lm(Pain.Score ~ Treatment, data = dt2)
summary(m1)
```
Now we have the difference in means (effect) for A and B relative to Placebo.
**Important** - The thing to keep in mind is that when the independent variable is **categorical** the regression coefficients represent difference in means of the dependent variable.
**Important 2** - The second thing to keep in mind is that associations can be interpreted as effects if you have a randomized trial (the independent variable is randomized), or if it is not randomized but you are sure you are taking into account the most important controls.
### Estimated Means
Usually, though, when the predictor of interest is categorical it is cleaner to present the means with SE or confidence intervals, rather than regression coefficients.
**Estimated Marginal Means (EMM)** are the means of the outcome for each level of the predicotr of interest when we control for other important variables. If there are no other predictor in the model, then the EMM will be the same as the raw means in the data.
THe package "emmenss" helps you get EMM. Here we get both the EMM and the diferences in EMM for all pairs of treatments. All this comes from the model, not from the data.
```{r}
em <- emmeans(m1, pairwise ~ Treatment, adj = "Bonferroni")
em
```
Notice that because we adjust the p-values, they are different from what we got with "summary(m1)" above, otherwise they would be the same.
And we can show it in a graph, which is a good and transparent practice (as compared to just reporting p-values). NOtice that in the graph we can clearly see how different A B and palcebo are from each other, and how precise they are. Notice also that we try to start the y-axis from zero to avoid inflating the perception of differences between treatment groups.
```{r}
plotdata <- em$emmeans %>% as.data.frame()
plotdata
ggplot(plotdata, aes(x = Treatment, y = emmean))+
geom_point(size = 2, shape = 1, stroke = 2)+
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.1)+
ylab("EMM +/- 95% CI")+
ylim(0,12)+
theme_base()
```
The idea is that you first look at the ANOVA to see if there is any treatment effect. If so,then you explore model coefficients or EMM. However, if you are using tests, it is even better if you go straight to the test you want without looking at ANOVA (the fewer p-values, the better). If you can get that with "emmeans" you probably should do it.
### Formatting output
The function "tab_model" from "sjPlot" package gives you a nice table for the regression model.
```{r}
tab_model(m1)
```
And "kable" from package "kableExtra" can help you format any table.
```{r}
kable(anova(m1),
digits = 2,
caption = "ANOVA Table - Treatment effect on Pain Relief") %>%
kable_styling(full_width = FALSE)
```
## Research Question 2
Let's pretend we don't have a randomized trial, so that treatment groups are not assigned to subjects randomly. Now, we need to worry about *confounders*. You can think of confounders as variables that affects simultaneously the independent and the dependent variable. When that happens, there will be an association between independent and dependent variable that is solely due to the fact that they have a common cause.
**A classical example** is marital status (single/married) -> mortality rate. We will notice that average mortality rate is higher among married individuals. Does being married increases one's probability of dying? Well, not really. There is a clear confounder: Age. Married people are older, that is why they are associated with higher mortality rates. We need to compare single and married individual **of the same age**, that is, we need to control for age if we want to estimate the true effect of being married apart from the age effect. If we control for age then we find in general that mortality rate is lower for married individuals, meaning that to be married is in fact protective against death.
This is the issue with non-randomized studies: there can always be confounders lurking around and so we need to think hard what they are and control for them. Demographics are good candidates because they cause things, they are not caused by things. Let's take age. It could be that older people choose different treatment groups as compared to younger people, and that older people have a different perception of pain relief. So, age can affect both, and can be a confounder. It is not reasonable to think treatment group can affect your age, of your pain relief score can affect your age - the effect will always be on the opposite direction.
Let's then control for Age, Province and Center, as we think they may be confounders. Now our model is a bit more complex.
```{r}
m2 <- lm(Pain.Score ~ Treatment + age + Province + Center, data = dt2 )
summary(m2)
```
Notice that we are still interested on the treatment effect, the other variables being controls and we are probably not interested on them. Here we have the ANOVA table, where the treatment groups are still highly significant.
**Note** - When we have more than one predictor, we should use the "Anova" function from the "car" package, and specify type 3 sum of squares, which is more appropriate for umbalanced data and if you are not interested on the order of the variables in the model. The "anova" function from base R will give you a type 1 sum of square, which does sequential tests for which the variable order matter. In this case, for example, a sequential test would imply that we test Treatment without accounting for any other variable, since Treatment enters the model first.
```{r}
car::Anova(m2, type = 3)
```
And we can still get the treatment effect using "emmeans" but now it is controlled by Center, age and Province
```{r}
em <- emmeans(m2, pairwise ~ Treatment, adj = "Bonferroni")
em
```
And the EMM plot. Things dont change much meaning that Center, Age and Province may have some effect but they dont seem to be important confounders.
```{r}
plotdata <- em$emmeans %>% as.data.frame()
plotdata
ggplot(plotdata, aes(x = Treatment, y = emmean))+
geom_point(size = 2, shape = 1, stroke = 2)+
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.1)+
ylab("EMM +/- 95% CI")+
ylim(0,12)+
theme_base()
```
### Continuous Predictors
In this case, the key variable of interest is treatment group, a categorical variable. But sometimes we are interested in the effect of a **continuous variable**.
**Research Question** - what is the effect of age on pain relief score?
The coefficient of age in the model is an estimate of such effect. It indicates the increasein pain score resulting from increasing age by 1 unit.
```{r}
summary(m2)
```
If we increase age by one unit, there is an increase in pain score of -0.003. That is, as age increases, the pain score decreases (negative coefficient), but by a very small and non-significant amount.
That means, if you compare a 20 years old person with a 21 years old person, the 21 years old is predicted to have pain score in average 0.003 lower. Same if you compare a 43 with a 44 years old. -0.003 is the change caused by one year change in age, no matter from which age to which age.
If we ignore the effect of controls (assume they have little effect), then we can visualize this association in a scatterplot.
```{r}
ggplot(dt2, aes(x = age, y = Pain.Score))+
geom_point(color = "gray75")+
stat_smooth(method = "lm", formula = y~x, se = F)+
theme_538()
```
We see that, as expected, the pain score goes down with age, but very little and far from significantly.
Every continuous predictor coefficient is interpreted the same way: **as we increase the predictor by 1 unit, the dependent variable changes by the [coefficient of the predictor].**
**Important** - We in general dont expect a change of 1 year in age to have much effect in most things. So the regression coefficient for age, measured in years, tends to always be low, sometimes very low. In order to overcome that, you can try to change our variable scale to age in 10 or 20 years (by dividing it by 10 or 20). The model witll be the same but the coefficients will be bigger (change in the outcome relative to changing age by 10 or 20 years).
### Multiple Predictors
A word on interpretation of regression coefficients with multiple predictors. As we have seen, both categorical and continuous predictors have their coefficients interpreted as the **average** effect on the dependent variable caused by changing the predictor by one unit **in average**. In a continuous predictor, one unit will be obvious changes like increasing age by 1 year. In a continuous predicotr, one unit is to move from reference to any of the other groups (all of them are one unit change because each category is internally a 0/1 variable).
However, if we have controls in the model it is important to remember that the effect is equivalent to changing the predictor by one unit while **keeping the other predictors fixed**. This effect assumes the other variables are fixed, so it is called a marginal effect.
Lets say we only have treatment group (A/B/P) and sex (male/female). If we do not control for sex, then the coefficients of A and B will be simply the difference from P in average outcome. However, once we add sex, then the coefficient of A will be the average of the coefficients of A for males and the coefficient of A for females.
## Regression Diagnostic
Regression model have assumptions, as does most statistical analysis. Assumptions are important because the validity of inferential results depend on assumptions.
Imagine that I dont know how to swim and there is a swimming pool. I can say " Assuming that this swimming pool is shallow, I can jump into it and I will not drown". The whole conclusion may not be valid if the assumption is incorrect. Or it may be still valid, maybe it is not too deep, or someone throws a life-jacket.
Imagine that you want to test if I can swim and there is a swimming pool. You think: "assuming the swimming pool is deep enough, I can throw him into it and if he drowns it is because he cannot swim". But your assumption is incorrect and the swimming pool is shalow. The result is that I don't drown, to which you conclude I can swim. However, your assumption was wrong, so your conclusion may be wrong too, you just don't know.
In statistics it is the same, if the assumptions are violated we don't know how valid are our results. Well, some times we have some good idea. **Regression Diagnostic Analysis** is when you try to see if assumptions are reasonable.
### Normality of Residuals
Notice that the normality assumption is not for the dependent variable, but for the residuals of the model.
This is not a strong assumption and I prefer to access it visually using a histogram. If your sample size is reasonably large (10 times the number of parameters) then you probably should not worry too much about normality (see [1]).
*[1] Schmidt, Amand F., and Chris Finan. "Linear regression and the normality assumption." Journal of clinical epidemiology 98 (2018): 146-151.*
In our case, we have 19 coefficients + 1 standard error = 20 parameters. A sample of 200 or more should be large enough according to the reference. Our sample is 423. So, lets take a look at the distribution of the model residuals.
```{r}
dt2$res.m2 <- residuals(m2)
ggplot(dt2, aes(x = res.m2 ))+
geom_histogram(bins = 30)+
theme_tufte()
```
The residuals are clearly not normally distributed, but I would say they are probably good enough. You can try transformations to achieve normality, but the price you pay in terms of interpretation and additional work is not worth the gains. Variables heavily skewed may be transformed using log or square root, but if that is not the case you are probably okay.
If you really need, or want to do it, there are normality tests in R, of which the most popular is probably Shapiro Wilks.
```{r}
shapiro.test(residuals(m2))
```
Notice that the test is highly significant, as one would expect because the residuals dont look very normal. This is probably not consequential for inferences, but if you focus on the result of the normality test, you will be forced to do something about it, like a transformation. It may not be easy to find a transformation that works well and it may make interpretation a big pain. It is too high of a cost to pay for little or no gain other than you are apparently doing "good stats".
### Outliers
Outliers are defined as large residuals and you can just look at the same histogram we did beore to inspect for such large residuals. The "plot" function provides diffeernt types of diagnostic plots, and the plot number 5 contains standardized residuals, which if larger than 3 or smaller than -3 can be considered outliers.
Here we dont really see outliers, but if we did we would need to inspect the data point to see if there is something wrong with it and if it should be removed. You can also re-fit the model without that offending point with high residual and see if much change.
```{r}
plot(m2,which = 5)
```
Here we look for any offending point. we see there are none.
```{r}
dt2$std.res <- MASS::stdres(m2)
dt2 %>% filter(abs(std.res) > 3)
```
### Leverage
Points with high leverage are points that that have atypical values for the predictor variables. Usually, a point with leverage larger than 2 times the average leverage is considered suspicious. We can use the same plot we used above to take a look at the leverages.
Visually, it seems that every point with leverage above 0.08 are outliers.
```{r}
plot(m2,which = 5)
```
We can take a look at these points, who they are and the values of the predictors. The function "hatvalues" give us the leverage. We see that Centers 1, 2 and 3 are rare in BC and 4, 5 and 6 are rare in AB. That may explain why they are "atypical".
```{r}
dt2$leverage <- hatvalues(m2)
meanhat <- mean(hatvalues(m2))
meanhat
dt2 %>% filter(leverage > 0.08) %>%
select(id,Treatment, age, Province,Center) %>%
as.data.frame()
with(dt2, table(Province,Center))
```
We can rerun the model. Notice that the removal of these points affected meaningfully the p-values of Province and Center, but not our key variable Treatment.
```{r}
lm.test <- update(m2, data = filter(dt2, leverage<0.08))
anova(lm.test)
anova(m2)
```
Here we compare model coefficients. The coefficients for treatment group did nto change much, so in the absence of an external good reason to remove the high leverage values, I would stick with the model with full data.
If it had changed, we probably should be transparent and show both results while selecting as primary result the one that we think makes more sense (if we dont see anything wrong with the high leverage values, then maybe the full data model makes more sense).
```{r}
tab_model(lm.test,m2,
title = "Comparison of Model Coefficients",
dv.labels = c("Model without large leverage","Model with full data"))
```
### Influence
Influence diagnostics are measures that aim at telling you how influential each data point is in the model as a whole, or particular coefficients.
The Cook's Distance is a measure of overall influence each point has in the model. The plot number 4 lets you visualize it.
```{r}
plot(m2,which = 4)
```
There is a rule of thmub that Cook's Distance > 4/(n-p-1) are indicative of influential points. n = sample size, p = number of predictors and so n - p - 1 is the degree of freedom of the model. We can take a look at it.
```{r}
dt2$cook <- cooks.distance(m2)
th <- 4/(m2$df.residual)
dt2 %>% filter(cook > th) %>%
select(id,Treatment,age,Province,Center, Pain.Score, cook) %>%
as.data.frame()
```
There are many such points. While you can take a look at them all, it would probably not be reasonable to call them all "influential" and remove them all. You dont want to remove too many points. An alternative would be to visualize the histogram of Cook's Distance and decide based on that.
```{r}
ggplot(dt2, aes(x = cook))+
geom_histogram(bins=50)+
theme_bw()
```
Let's say in this case we would remove Cook's Distance 0.02 and higher as these are the really extreme outliers. Let's do it both ways.
```{r}
test1 <- update(m2, data = filter(dt2, cook < th))
test2 <- update(m2, data = filter(dt2, cook < 0.02))
tab_model(m2,test1,test2,
title = "Comparison of models without influencial points",
dv.labels = c("Full Data","Cook < 4/df","Cook < 0.02"))
```
Considering the coefficients of Treatment, the changes dont seem big whichever way we do it, and the difference to Placebo is always significant. We can also compare estimated marginal means for treatment groups (below). In general not much evidence of damaging influential points and we can stick to the full data.
```{r}
emmeans(m2, ~Treatment)
emmeans(test1, ~Treatment)
emmeans(test2, ~Treatment)
```
### Heterogeneity of Variance
An assumption often neglected but more important than normality is that the variance of the residuals is the same in all groups and throghout the data. That is because the linear regression model will only estimate one variance parameter, which appears as **residual standard error** when you use the *summary* function for the model.
```{r}
summary(m2)
```
The checking of this assumption is done by the plot number 3, which is a scatterplot of the model predicted values by the square root of the standardized residuals. In our case, as we see below, the variability of the points seems to increase a little bit as the fitted values increase, but not a lot. The heterogeneity of variance is not very clear, so it is not concerning in this case.
When we do have heterogeneous variance we can deal with it by usign transformation, robust standard errors, or by modeling the variance.
```{r}
plot(m2, which = 3)
```
Like normality, the visual inspection of the plot above should be good enough. However, there are also tests, like the Breusch-Pagan test which tests for heterogeneous variance. These tests are problematic because in large sample sizes they will detect departure from homogeneous variance that are too small to be of concern and in small samples it will not detect large and concerning violations of homogeneous variance. But here is the test whihc is significant for our data.
```{r}
lmtest::bptest(m2)
```
One way to deal with heterogeneous variance is by using robust standard errors. The Package "estimatr" has the function "lm_robust" which does that.
```{r}
r2 <- lm_robust(Pain.Score ~ Treatment + age + Province + Center,
se_type = "HC3",
data = dt2 )
summary(r2)
```
Or we can use "tab_model" from "sjPlot" package to get a nicer table.
```{r}
tab_model(m2, vcov.type = "HC3", show.se = TRUE)
```
### Linearity
If you have continuous predictors, regression models will assume they are linearly related to the dependent variable. This is an assumtpion that in general holds, but it tends to be very consequential if the actual relationship departs to a large extent from linear.
You can check this assumption in two ways, and in both cases just a visual inspection will in general be good enough. Let's consider age, which is the only predictor we have.
The first thing you can do is a scatterplot of your continuous predictor (age) and the outcome (in this case the pain score), and overlay both the best straight line, and a non-parametric smoothing line (like a loess model), and compare.
```{r}
ggplot(dt2, aes(x = age, y = Pain.Score))+
geom_point(color = "gray70")+
stat_smooth(linetype = "dashed", color = "black")+
stat_smooth(method = "lm", color = "black")+
theme_gdocs()
```
The lines are not exactly the same, but they are very close and the straight line seems good enough, showing that we dont need to worry about the linearity assumption. You can also look at the confidence bands, which overlap a lot indicating that the models are not that different.
The alternative would be to use residuals, instead of the actual outcome, to account for other predictors.
```{r}
dt2$residuals <- residuals(m2)
ggplot(dt2, aes(x = age, y = residuals))+
geom_point(color = "gray70")+
stat_smooth(linetype = "dashed", color = "black")+
stat_smooth(method = "lm", color = "black")+
theme_gdocs()
```
Again, we dont see many trends, the residuals look quite random when plotted against the predictor of interest here, which is age.
If we had found that a linear model does not fit well, one thing we could to is to try transformations of the dependent variable, or of age. I usually ovoid transformations with the dependent variable if I can because it can make the interpretation more difficult. Maybe the only exception is when a log transformation is the one suggested, which will be the case if the dependent variable is skewed to the right (not our case) and/or the scatterplots above shows that the relationship is exponential, not linear.
If the log transformation of the dependent variable is not the case, then a transformation of the independent variable can be tried. Usually, a polynomial transformation will solve the issue. Here we add a quadratic function of age (just as an example, since we dont need it in this case).
```{r}
m2a <- lm(Pain.Score ~ Treatment + age + I(age^2) + Province + Center,
data = dt2 )
summary(m2a)
```
We see that the quadratic component of age has far from significant coefficient. The significance of this coefficient could also be used as an indicator of whether the linear trend is enough, although, as I said, a visual inspection should be enough and avoid the overuse of p-values.
Fianlly, if the continuous predictor is there just as a control, it could be a good idea to just use a more flexible function from start, rather than assuming linearity. The idea being that we just want to ensure the predictor is well accounted for, without doing lots of work on how to account for possible lack of linearity.
In our case, for example, we could fot a cubic spline funcion of age from the start, rather than a linear function. Here, the "bs" function from "splines" package will do the job. Splines can have different number/position of knots, and different degrees, but here we stick to a cubic spline with a single knot (just a cubic polynomial model) which is the default.
```{r}
m2b <- lm(Pain.Score ~ Treatment + bs(age) + Province + Center,
data = dt2 )
summary(m2b)
car::Anova(m2b, type = 3)
```
### Multicolinearity
Multicolinearity usually happens more often in predictive models, in which you may add many variables at the same time. In explanatory models it is rare since you will (or should) choose variables for the model carefully.
Multicolinearity happens if a variable in the model is predicted by the others perfectly, or almost perfectly. It could happen if you have two variables that are very similar (for example, you measured depression in two ways, and you add both into the model as predictors), or when categorical variables have levels that overlap (for example, you add to the model Province and Region, but Ontario is considered both a Province and a Region, in your definition of Region). If you are carefully choosing predictors for your model, these things tend not to happen.
When predictors are highly correlated, it means they overlap a lot and it is difficult for the model to tell the effect of one from the effect of the others. The result is that the estimated effect (coefficient) can become very imprecise. This can be a problem, particularly if you are interested in those coefficients.
Indications of multicolinearity can be found in model coefficients that are extremely high or low, or cannot be estimated, or the whole variable is dropped from the model.
The "vif" function in the "car" package stands from Variance Inflation Factor. It estimates how much the variance of coefficients increase because of the correlaiton with other predictors. In model with variables with many levels, vif will output the generalized variance inflation factor and a rule of tumb is that you should start worrying about predictors that have this factor above 10. Notice tha we should look at the adjusted GIF, in the last column.
```{r}
car::vif(m2)
```
In our case we dont have much to worry about. Also, if the variable with high VIF is not of interest (meaning it is a control), we are probably okay too. However, high VIF are also a call for you to rethink your model, if the variables you have in the model are the correct ones. A remedy for high VIF is to remove the offending variable. You can also conduct a sensitivity analysis, where you fit a model without the offending variable and compare the results.
# Multiple Tests
It is often the case when using regression that we are led to do multiple statistical tests. As we have seen in our presentation about p-values, tests need to be pre-planned, and if we are doing more than one we need adjustment of p-values, and we need to ensure we have good power to do that.
## Adjustment
There is no consensus on which adjustment is better if you do multiple tests, and this is a big issue because there are many types of adjustments out there. Besides, there is no free lunch and adjustment always will come at the expense of losing power (that is, you lose your ability to detect existing effects). It goes like this:
We apply adjustments because we want to ensure that when we say an effect is significant, we really have an effec there. The price we pay for that confidence on the significant effects is that we are very strict with what we call "significant" and end up calling lots of real effects "non-significant".
That is, we lose power to detect real effects, unless they are realy big.
An alternative would be to not be so strict about what we call "significant", that is, not use adjustment. Without adjustment we are less confident when we say some effect is significant, but at least now we will not miss so many effects.
We have seen already how to do pairwise comparison and apply adjustment, using "emmeans" command from the "emmeans" package. Here we do it for treatment group in our second model.
```{r}
emmeans(m2, pairwise ~ Treatment, adj = "bonf")$contrast
```
Or we can use Tukey adjustment which is less conservative than Bonferroni (p-values are lower).
```{r}
emmeans(m2, pairwise ~ Treatment, adj = "tukey")$contrast
```
Or we can turn the adjustment off.
```{r}
emmeans(m2, pairwise ~ Treatment, adj = "none")$contrast
```
Heer we apply Bonferroni, but we limit the number of tests, but testing only the pairs that are relevent. It is always a good idea to do as fewer comparisons as possible.
```{r}
emmeans(m2, trt.vs.ctrl ~ Treatment, adj = "bonf")$contrast
```
## Estimation Framework
And as we have already seen, an alternative is to use the estimation framework. That is, just report means and confidence intervals, and dont focus on p-values. You can usually do that with a graph like an errorbar chart, and it is particularly appealing when you have a factor with many levels.
Imagine, for example, that we are interested int he effect of Center. There is a huge number of pairwise comparisons and it woudld cost a lot of power to use somethign like Bonferroni adjustment.
You can instead show estimates and confidence intervals.
```{r}
emmeans(m2, ~ Center)
```
Or even better, get them into an errorbar plot. We can then interpret the plot by noticing that three Centers tend to have higher Pain Relief score than the others, without going through all pairwise comparisions.
```{r}
plotdata <- emmeans(m2, ~ Center) %>% as.data.frame()
ggplot(plotdata, aes(x=Center, y=emmean))+
geom_point(shape = 1,size = 3, stroke = 2)+
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.1)+
ylab("EMM =/- 95% CI")+
ylim(0,15)+
theme_clean()
```
# Interactions
Interactions in regression models has to do with the study of effect modifiers or moderators.
For example, the researcher could have some reason to think that the treatments would improve pain, but this effect would be moderated by Province, because the treatment involves some procedures that varies in the Provinces.
**Research Question** - Does Province moderates the treatment effect? Does the effect of treatment depends on which Province it is administered? Does the treatment effect varies depending on the Province?
All the same question, which in could be addressed by an interaction between treatment group and Province in the regression model. By using an "*" between Province and Treatment we get R to add the Treatment, Province and Treatment by Province interaction.
```{r}
options(max.print = 500)
m3 <- lm(Pain.Score ~ Treatment * Province + age + Center,
data = dt2)
summary(m3)
car::Anova(m3, type = 3)
```
It is a little difficult to study the interaction using model coefficients. One thing I like to do is to look at the ANOVA table and if the interaction is significant I look at the estimated means.
Here are the EMM:
```{r}
emmeans(m3, ~Treatment*Province)
```
But it is better to create a plot.
```{r}
plotdata <- emmeans(m3, ~Treatment*Province) %>% as.data.frame()
ggplot(plotdata, aes(x = Province, y = emmean))+
geom_point(aes(shape = Treatment, color = Treatment), size = 2,
position = position_dodge(width= 0.2))+
geom_errorbar(aes(ymin = lower.CL,ymax = upper.CL, color = Treatment),
width = 0.1, position = position_dodge(width= 0.2))+
theme_wsj()
```
When continuous variables are involved in the interaction, it is a little bit trickier.
Lets assume the research question has to do with Age as the moderator of the treatment effect, instead of Province. Here is the model.
```{r}
m4 <- lm(Pain.Score ~ Treatment * age + Province + Center, data = dt2)
summary(m4)
car::Anova(m4, type = 3)
```
In this case the age by Treatment interaction is not even close to significant. But if it were, we would need to look at the treatment effect at different ages (to see how age moderates the treatment effect).
```{r}
emmeans(m4, ~Treatment*age, at = list(age = seq(15,80,5)))
plotdata <- emmeans(m4, ~Treatment*age, at = list(age = seq(15,80,5))) %>% as.data.frame()
ggplot(plotdata, aes(x = age, y = emmean))+
geom_point(aes(shape = Treatment, color = Treatment), size = 2,
position = position_dodge(width= 2))+
geom_errorbar(aes(ymin = lower.CL,ymax = upper.CL, color = Treatment),
width = 0.1, position = position_dodge(width= 2))+
theme_wsj()
```
The differences between treatment groups are similar at all ages, which is expected because the interaction had p-value very high, meaning that there is no evidence that age influences the treatment effect. And this is what we see in the graph. We also see that the precision of the estimates increase as we get close to the middle of the age range, because we have more data there, where as at the extremes of age we have smaller sample.
Okay folks, that is it for today!
Thanks for the interest,
Marcos