---
title: "Mixed Models and Correlated Data"
author: "Marcos Sanches"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
toc_dept: 4
---
```{r setup, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
# Objective
As we get to our final workshop, we also enter one of the most used models at CAMH - the mixed effect models, which are also called multilevel models in many instances.
These are models for correlated data, of which the longitudinal dat ais the most common example. We can also think of it as repeated measures within sampling units, such as repeated measuring depression in the same subject, or repeated measuring a biomarker over many regions in the brain of a single subject.
The feature that this data axhibit is that it is not independent. As you collect new data within the same subject, that new data is only partially new because it tend to be very similar to previous data collected for the same subject. Traditional statistical techniques will oversee this and consider every new piece of information as independent and 100% new.
While there are many different situation where this sort of thing can happen, today we will focus on mixed effect models for continuous and normal data, which is by far the mose popular.
We hope by understanding a bit of these models you will be in a reasonable place to move on to related models that addresses other situations with correlated data, like models for binary and count data.
# A new RStudio project
Here is the steps to create a new RStudio project:
1. Create a folder for each data analysis project.
2. Open Rstudio.
3. Go to File > New Project > Existing Directory
4. Navigate to the folder you created
5. Click Create Project. A “…Rproj” file will be created in the folder. Next time, you double click this file to open your project.
6. Go to File > New File > R Markdown (or R Script) and click Ok.
7. Save the new R Mardown file. Here you will store your R script.
8. Make sure you store all your scripts. Ideally, you want to have in scrip all single steps from reading the data to final results.
# Cleaning R environment
Again we clean the R environment - we can do that because we are confident the script we will create will reproduce all relavant analysis, we dont need to keep any dataset or analysis results other than the original dataset!
```{r}
rm(list = ls())
```
# Reading a SPSS dataset
For today's workshop we will use the SPSS dataset "Cholesterol.sav". We read it with package *haven*. Here is a short description of the data.
Here is a description of this dataset.
--------------------------------------------
Data on serum cholesterol from the National Cooperative Gallstone Study (NCGS).
Source: Table 2 (page 657) of Wei and Lachin (1984).
Reproduced with permission of the American Statistical Association.
Reference: Wei, L.J. and Lachin, J.M. (1984). Two-Sample Asymptotically Distribution-Free
Tests for Incomplete Multivariate Observations. Journal of the American Statistical Association, 79, 653-661.
Description:
The data are from the National Cooperative Gallstone Study
(NCGS), where one of the major interests was to study the safety
of the drug chenodiol for the treatment of cholesterol gallstones
In this study, patients were randomly assigned to high-dose (750 mg per
day), low-dose (375 mg per day), or placebo. This dataset consists of
a subset of data on patients who had floating gallstones and who were
assigned to the high-dose and placebo groups.
In the NCGS it was suggested that chenodiol would dissolve gallstones
but in doing so might increase levels of serum cholesterol. As a result,
serum cholesterol (mg/dL) was measured at baseline and at 6, 12, 20, and
24 months of follow-up. Many cholesterol measurements are missing because
of missed visits, laboratory specimens were lost or inadequate, or
patient follow-up was terminated.
Variable List:
Treatment Group (1=High dose chenodiol, 2-Placebo), Subject ID,
Response at Baseline, Response at Month 6, Response at Month 12,
Response at Month 20, Response at Month 24.
Note: Response = Serum cholesterol (. denotes missing value).
```{r}
dt <- haven::read_sav("Cholesterol Wide.sav")
names(dt)
```
# Load packages
Here we load the packages that we will use in this workshop.
```{r}
library(tidyverse) # package for data manipulation and plotting
library(emmeans) # estimated marginal means
library(haven) # SPSS dataset tools
library(ggthemes) # themes for ggplot2
library(sjPlot) # tools for summarizing model coefficients
library(kableExtra) # formatting tables in HTML
library(lme4) # most popular package for mixed models
library(lmerTest) # useful package for mixed models
library(nlme) # useful package for mixed models
```
As the data came from SPSS, we can use *as_factor* to transform some variable sin factor and make interpretation easier. We also give the variables better names. It is important to ensure the variables we will use are in proper format.
```{r}
dt2 <- dt %>%
mutate(sex = as_factor(sex),
age_grp = as_factor(age_grp),
treat = as_factor(treat),
miss = as_factor(miss))
```
# Wide and Long dataset
We will usually get the data in **wide** format. That means that each sample unit, usually the subject, is in one line, and repeated measures are in different columns. Our data is in wide format, in this case with one subject per line and multiple columns representing each time point.
```{r}
dt2 %>% select(uniqueid,treat,m0:m24) %>% head()
```
A mixed effec model will require the data to be in **long** format. You can think of long format as the outcome being in a single column. In this case, the cholesterol level in many columns should be in a single column. Since we have 5 columns corresponding to repeated instances when cholesterol was measured, in the long format dataset each subject will have 5 lines (or less if there are missing values). We will also need a variable time.
The *pivot_longer* function in the *tidyr* package, loaded with *tidyverse* is one possible way to restructure the data from wide to long. Many people also use the *reshape* function in the *reshape2* package.
```{r}
long <- dt2 %>%
pivot_longer(cols = m0:m24) %>%
rename(period = name, chol = value , subject = uniqueid) %>%
mutate(time = as.numeric(substr(period,2,3)),
period = factor(period, levels = c("m0","m6","m12","m20","m24"))) %>%
select(-id,-miss)
head(long,10)
```
# The Mixed Effect Model
The Mixed Effect Model contains fixed and random effects in its specifications and can be further enhanced by adding residual covariance modelling the *nlme* package.
The definition of fixed and random effects are not agreed upon in the literature and things may become clearer to you with practice. However, we could briefly say that:
**Fixed Effects** are variables that we care about their effect, or imply "few" effects. Categorical variables with few levels like treatment group, sex, ethnicity, marital status, time, etc.
**Random Effects** are often categorical variables with many levels, like subjects, or categorical variables that were part of the sampling design (like centres, schools) in that a random sample of those variables were selected for the studies. You can also think of these variables as categorical variables that have many levels, but not all of them are in our study. For example, there are many study centres, but we just selected a few for our study. Many subjects, but for our study we selecte 100 of them. If there are too few levels (for example, only 3 study centers) then you may consider to use it as fixed effect as you may start getting convergence issue in random effect modeling.
Again, this definition is a bit too strict and you will get a better sense of it as you get more experience with these models.
In simple longitudinal data like the one we have, things are simple: subjects are random effects and all else can be defined as fixed effects.
## Research question
In order to proceed with some modeling, lets define our research question. Later on we will change it to get familiar with alternatives for these models. First some definitions.
**Treatment main effect** tests if there is any difference between group, regardless of which time. Considering a randomized trial, this could be of interest if we remove baseline. Is there a Treatment difference if we consider 6 to 24 months? In a randomized trial we know there is no difference at baseline.
**Time main effect** tests if there is a change in outcome over time. This is usually not of interest unless the study has only one arm.
**Treatment by Time Interaction** tests if the change over time is moderated by the treatment group. This is usually the effect of interest in mixed models.
Sometimes, however, researchers are interested in more specific effect, like the treatment difference at the final time point, or change from baseline to the last time point. We will see how to test this using mixed models.
Other variables in the dataset, other than time and treatment, can be used as controls, confound or moderators.
## Random Intercept Model
**Random Intercept** refers to *subjects as random effects*, where each subject has its own intercept estimated. This is similar to the situation where you add subject as a categorical variable in a regression model, then you get a coefficient for each subject, plus a reference subject. That will be a lot of coefficients! Random subject effects will instead assume those lots of coefficients are normally distributed and estimated the mean and standard deviation of the resulting normal distribution, therefore estimating only two coefficients. This is a big gain.
Here we use the *lmer* function from *lme4* package to fit the model. If you load *lmerTest* package, then you will use *lmer* function from that package, which is an *enhanced* function. Sometimes, if you encounter some issue with *lmer*, you can try to explicitly use the version from *lme4* package instead (by using *lme4::lmer*) to see if things are better.
So, here is the simplest random intercept model to test the time by treatment interaction in our data. It is good to first fit model using Maximum Likelihood estimation (REML = FALSE) so that we can properly compare models using likelihood ratio tests and AIC. Then in the final model we can revert back to Residual maximum likelihood (REML = TRUE).
```{r}
m1 <- lmer(chol ~ period*treat + (1|subject),
data = long,
REML = FALSE)
```
Notice the specification of the random effect part *(1|subject)*. As specified it jsut means a simple model where subjec is the random effect, but also intereted as subject being the clustering variable and an intercept (average) fitted for each subject. So, you can think of it this way: you will always need an effect variable then the pipe, then the clustering variable. If the effect is just a "1" then we can say the clusering variable is the random effect.
Lets look at the results.
```{r}
summary(m1)
```
In the "Random Effect" part, we get estimates for the variance of the normal distribution fitted to he subject intercepts, as well as to the residuals. You can think of the subject variance as how subjects are different from each other (Between Subject Variance BSV), and residual variance as how the outcome varies within indiivudals (Within Subject Variance WSV). Here it varies more within individuals than between individuals, indicating not very strong correlation between measures within individuals. A measure of this correlation is the Intra-Class Correlation (ICC) calculated as BWV/(BWV+WSV).
```{r}
performance::icc(m1)
796/(1017+796)
```
You can look if this correlation makes sense in the wide data, where we can calculate the within subject correlation by simply running a correlation among time points.
```{r}
cor(select(dt2,m0:m24), use = "pairwise.complete.obs")
```
It makes sense, the ICC is kind of an average of the correlations.
Other than that, we have model coefficients which are interpreted the same way as in regression models. However, in this model we are more interested on testing the interaction, which can be done with the *anova* function. Once you load *lmerTest* package, the *anova* function will give you type 3 sum of squares and Satterthwaite's degrees of freedom.
```{r}
anova(m1)
```
Here we see a significant interaction, which we can study using estimated marginal means.
### Marginal Means
We can use the *emmeans* function from the *emmeans* package to interpret the interaction effect.
```{r}
# just looking at emmeans for each treatment and time (good for a plot)
em1 <- emmeans(m1, ~treat*period)
em1
# Comparing treatment groups at each time point.
em2 <- emmeans(m1, pairwise~treat|period)
em2
# Comparing time points to baseline for each treatment.
em3 <- emmeans(m1, trt.vs.ctrl~period|treat)
em3
```
Once an interaction is significant, it is nice to show an interaction plot. Interpreting this graph is probably better than conducting many statistical tests to figure out where the difference is.
```{r}
ggplot(as.data.frame(em1), aes(x = period, y=emmean, color = treat))+
geom_point(position = position_dodge(width=0.2), size = 3)+
geom_errorbar(aes(ymin = lower.CL, ymax = upper.CL), width = 0.2,
position = position_dodge(width=0.2))+
ylab("Cholesterol Level +/- 95%CL")+
theme_base()+
theme(
legend.position = "bottom",
legend.title = element_blank()
)
```
### Linear Contrast
Once you look at the estimated marginal means table, you can use it to test some contrasts calcualted from it. For example, here is our EMM table.
```{r}
em1
```
Let's say we just want to test he difference between treatment and placebo at 24 months. These are the two last lines of the EMM table, so we can specify the difference we want like this.
```{r}
ctt <- contrast(em1, list(`(Treat-Placebo)@24 Mth` = c(0,0, 0,0, 0,0, 0,0, 1,-1)))
ctt
```
Or here we have the difference in change from baseline for the two treatment groups. We also calculate the change from baseline to 12 months in each treatment group.
```{r}
ctt <- contrast(em1,
list(`(12m-BL)@trt-(12m-BL)@pcb` = c(-1,1, 0,0, 0,0, 0,0, 1,-1),
`(12m-BL)@trt` = c(-1,0, 0,0, 0,0, 0,0, 1,0),
`(12m-BL)@pcb` = c(0,-1, 0,0, 0,0, 0,0, 0,1)))
ctt
```
## Improving the model
The random intercept model (or subject as random effect model) is the simplest model we could have. However, this model many not account very well for the dependence structure in the data. By specifying additional random effect components, or covariance structure modeling we can see if the model improves of the random intercept only model. We wil go over some example.
### Adding a random slope
Since we have 5 time points for each subject, we can try and see if fitting a straight line for each subject rather than just an intercept, improves the model. A straight line would be the time effect, with time being continuous.
```{r}
m2 <- lmer(chol ~ period*treat + (1 + time|subject),
data = long,
REML = FALSE)
```
The model did not reach convergence. This will many times jsut be a warning and not of much importance, however, the *control* option allows us to try to get a model that achieves convergence. Here we get convergence by changing the optimizer.
```{r}
m2 <- lmer(chol ~ period*treat + (1 + time|subject),
control = lmerControl(optimizer = "bobyqa"),
data = long, REML = FALSE)
summary(m2)
```
Now we see that we have two random effects at the subject level. There is the intercept and the slope, and we have also a correlation between them. A good way to see if this model is better than the previous one is by looking at the AIC or using a likelihood ratio test via *anova*. Lower AIC is better and the test also shows that the random slope model is significantly better than the random intercept only.
Although this model fits better than our previous model, the conclusions are the same (significant interaction).
```{r}
AIC(m1,m2)
anova(m1,m2)
```
### Heterogeneity of Variance
By plotting the model residuals against covariates and the fitted values we may visually inspect for evidence of non-constant variance, which is a violation of the mixed model assumption of homogeneous variance.
Here we see that the variance does not see to be the same at different levels of fitted (predicted) values.
```{r}
resdata <- m2@frame
resdata$residuals <- residuals(m2)
resdata$fitted <- fitted(m2)
ggplot(resdata, aes(x = fitted, y = residuals))+
geom_point()+
theme_minimal()
```
Here we see that the variance of residuals seem to increase with time, but more so in the treatment group.
```{r}
ggplot(resdata, aes(x = period, y = residuals))+
geom_boxplot()+
facet_wrap(~treat)+
theme_minimal()
```
Using the *lme* function from *nlme* package we are able to try to model these seemingly heterogeneous variances.
Let's first fit our best model so far with *lme* function. Notice that here the specification of random effects is a little different, and we need to specify what to do with missing values.
```{r}
g1 <- lme(chol ~ treat*period,
random = ~1 + time|subject,
method = "ML",
na.action=na.omit,
data = long)
summary(g1)
AIC(m2,g1) # same model as m2
```
Here we add a variance parameter for each period, that is, we allow the variance to be different in each period. The model does not seem to improve very much.
```{r}
g2 <- lme(chol ~ treat*period,
random = ~1 + time|subject,
method = "ML",
weights = varIdent(form = ~1|period),
na.action=na.omit,
data = long)
summary(g2)
AIC(g1,g2)
anova(g1,g2)
```
However, if we allow different variances per treatment then the model improves.
```{r}
g3 <- lme(chol ~ treat*period,
random = ~1 + time|subject,
method = "ML",
weights = varIdent(form = ~1|treat),
na.action=na.omit,
data = long)
summary(g3)
AIC(g1,g3)
anova(g1,g3)
```
Now, a model that allows for different variances at each period within each treatment group seems to be even better.
```{r}
g4 <- lme(chol ~ treat*period,
random = ~1 + time|subject,
method = "ML",
weights = varIdent(form = ~1|treat*period),
na.action=na.omit,
data = long)
summary(g4)
AIC(g3,g4)
anova(g3,g4)
```
Here we try another function of the variance that allows for variance that are power function of the fitted values. However the model with random slope does not converge and we drop random slope. This model is not nested in our g4 model above, so we only look at AIC (anova function performs likelihood ratio test which is only for nested mdoels)
```{r}
g5 <- lme(chol ~ treat*period,
random = ~1 |subject,
method = "ML",
weights = varPower(),
na.action=na.omit,
# control = lmeControl(opt = "optim", maxIter = 100),
data = long)
summary(g5)
AIC(g4,g5)
```
### Correlation Structure
It could be that the random effects in the model dont account very well for the structure of dependence in the data. As we have noticed, data points within the same subject are likely to be correlated to each other. What random effect does is to model indirectly these correlations to some extent. Although most of the time the random effects will do a good job, there may be unmodelled correlation left in the data, particularly when the data involves measurement in time or space and we use simple random effects.
Here we add a AR1 correlation structure for each subject. This structure models a situation where data points within the same subject are more correlated if they are closer together than if they are further appart. This is reasonable whenever we have repeated data points over time. However, in our case, it does not seem to improve the model.
```{r}
g4a <- lme(chol ~ treat*period,
random = ~1 + time|subject,
method = "ML",
weights = varIdent(form = ~1|treat*period),
correlation = corAR1(form = ~1|subject),
na.action=na.omit,
data = long)
summary(g4a)
AIC(g4,g4a)
anova(g4, g4a)
```
The AR1 structure is modeled by the addition of a single auto-regressive parameter, which was estimated to be -0.056 in the model above (which is very low and we seem to do better without it). But we could be more flexible with the correlations within subject and here we fit the most flexible correlation matrix, one that requires the mdoel to estimate all correlations within subject.
With this we estimate a lot more parameters and the model does not seem to improve.
```{r}
g4b <- lme(chol ~ treat*period,
random = ~1 + time|subject,
method = "ML",
weights = varIdent(form = ~1|treat*period),
correlation = corSymm(form = ~1|subject),
control = lmeControl(opt = "optim"),
na.action=na.omit,
data = long)
summary(g4b)
AIC(g4,g4b)
anova(g4, g4b)
```
At this point we may stick with our model 4g, which allows for different variances depending on period and treatment group, but with no addition correlation modeling.
## Model Diagnostic
Like for linear models, mixed models also imply assumptions that should be checked.
### Linearity
If you have continuous predictors in the model, you should take a look at the linearity of the association between the predictor and the outcome by creating a scatterplot of the outcome and the continuous predictor, or between the residuals and the continuous predictor.
Lets assume a model were we control for age. We dotn see much evidence of lack of linearity for age.
```{r}
g5 <- lme(chol ~ treat*period + age,
random = ~1 + time|subject,
method = "ML",
weights = varIdent(form = ~1|treat*period),
na.action=na.omit,
data = long)
ggplot(long, aes(x = age, y= chol))+
geom_point(color = "gray80")+
stat_smooth(method = "lm", color = "black", se = F)+
stat_smooth(se = F)+
theme_minimal()
```
### Normality
A visual inspection in the histogram of the residuals should give you a good idea about existence of strong violations of the normality assumption. This is not a very important assumption because these models tend to be robust to its violation.
```{r}
hist(residuals(g5))
```
### Homogeneity of Variance
This is usually checked by plotting the residuals against fitted values and other predictors. If we find visual evidence that the variance is not constant we can try to improve the model by modeling the variance of residuals using the procedures we have seen above. An alternative is also to use transformations of the outcome, like the logarithmic and square root transformation, however, at the cost of making interpretatio of results more complex.
### Influence
Just like in linear models, it is possible to calculate how much each subject influences model coefficients, and also how much each data points influence those coefficinets.
The package *influence.ME* can calculate some influence measures for mixed models fitted through *lmer* function. Some other packages may calculate influence for mixed model fitted through the *lme* function, however, here we will use the *influence.ME* package. For that we need to go back to our model *m2*, which in terms of influence measure should not be different than using our final model *g4*.
#### Data Point Level
Here we calculate and plot the Cook's Distance, which is a measure of overall influence in the model. Here we calculate influence at the data point level (with obs = TREUE).
```{r}
datamodel <- m2@frame #dataset used in the model, that is removing missing values.
datamodel$cook <- cooks.distance(m2)
boxplot(datamodel$cook)
```
Probably the two values above 0.5 would be the most concerning. Here we see that tehse are values that at baselien looks too high or too low (compare with the histogram).
```{r}
datamodel %>% filter(cook>0.5) %>%
select(subject,chol,treat,period)
hist(datamodel$chol)
```
We can rerun the model without these two points and compare. These two points dont effect the general conclusion that the interaction is significant, but they seem to change the interaction coefficients quite a bit, so that it would be interesting to show both results in a report.
```{r}
m2b <- lmer(chol ~ period*treat + (1+time|subject),
control = lmerControl(optimizer = "bobyqa"),
data = filter(datamodel, cook < 0.5) )
tab_model(m2,m2b,
title = "Comparison of models",
dv.labels = c("Full Data","Two data points removed"),
show.se = TRUE)
anova(m2b)
```
#### Subject Level
We can also calculate how influential each subject is. Here we calculate the Cook's Distance for each subject and plot it sorted.
```{r}
infl <- influence.ME::influence(m2, "subject")
plot(infl, which = "cook", sort = TRUE)
```
Ideally, we would be able to see which subjects have highest values and remove them from the data, but here it is a little difficult. So we use a trick: store the graph then get the x and y coordinates.
```{r}
a <- plot(infl, which = "cook", sort = TRUE)
plotdata <- data.frame(subject = a$panel.args[[1]]$y,
cook = a$panel.args[[1]]$x)
tail(plotdata)
```
We see that subject 67 is the one with very high cook's distance. We can visualize this subject and the others to see if it gives us a clue why.
```{r}
long2 <- merge(long, plotdata)
ggplot(long2, aes(x = time, y = chol))+
geom_line(aes(group = subject), color = "gray75")+ # all subjects
geom_line(data = filter(long2, subject == 67),
aes(group = subject), color = "black")+ # high cook subject
facet_wrap(~treat)+
theme_base()
```
We see that the subject in the Placebo group has cholesterol values that drops hugely, different than what happens with other subjects. This could deserve an investigation of the data.
Now it would be easy to also look at the other subjects with seemingly high Cook's distance. If we choose 0.04 as threshold then there are an extra 3 subjects, all in the treatment group. Their trajectory varies a lot, but not sure enough to be concerning.
We could then rerun the model without these subjects to see how much they are affecting the model.
```{r}
ggplot(long2, aes(x = time, y = chol))+
geom_line(aes(group = subject), color = "gray75")+ # all subjects
geom_line(data = filter(long2, cook > 0.04),
aes(group = subject), color = "black")+ # high cook subject
facet_wrap(~treat)+
theme_base()
```
### Outliers
Outliers are data points with large standardized residuals. We can think of residuals that is beyond 3 in absolute values. Heer we plot residuals against fitted values. We see one or two large residuals that we could investigate by removing them from the model.
```{r}
plot(residuals(m2, type = "pearson", scaled = TRUE)~fitted(m2))
```
## Other Outcomes
We looked at the time by treatment interaction as outcome of interest. Here we will consider some different outcomes of interest.
### Difference at the end of the trial
If the goal is to test differences at month 24, then a model that does not include baseline as outcome and that controls for baseline should be preferred. Since the baseline value of the outcome tend to be highly related to the end of the trial value, we gain power by controlling for baseline. Likewise, we can control for other variables that are associated with the outcome too.
First we create a baseline cholesterol level to be used as control. We simply merge our dataself with itself after removing all points except for the baseline, and removing all variables except subject id and cholesterol level.
```{r}
long <- long %>%
left_join(long %>%
filter(time == 0) %>%
select(subject, chol) %>%
rename(basechol = chol))
```
Now we fit the model without baseline as outcome, but controlling for baseline. The model now has a boundary problem. In the results we can see that the correlation between random slope and random intercept is -1.00, which indicates that the issue is probably tht this model no longer need a random slope (remember, we removed the baseline).
```{r}
k1 <- lmer(chol ~period*treat + basechol + age + sex +
(1 + time|subject),
control = lmerControl(optimizer = "bobyqa"),
data = filter(long, time>0))
summary(k1)
```
We see that model without random slope has better AIC, BIC and adding random slope does not improve the model significantly according to the likelihood ratio test.
```{r}
k2 <- lmer(chol ~period*treat + basechol + age + sex +
(1 |subject),
control = lmerControl(optimizer = "bobyqa"),
data = filter(long, time>0))
anova(k2,k1)
```
We can use the *emmeans* package to test the difference at the end of the trial. We see that the difference is significant at month 24.
```{r}
emmeans(k2, pairwise~treat|period)
```
### Difference at post
Sometimes the time is not relevant and the researcher is interested on the difference in the outcome at any point in the post period (that is, after baseline). This would be a simple treatment main effect test. In this case we would not be interested in the interaction.
```{r}
k3 <- lmer(chol ~period + treat + basechol + age + sex +
(1 |subject),
control = lmerControl(optimizer = "bobyqa"),
data = filter(long, time>0))
anova(k3)
```
### Difference in slopes
If a straight line is assumed to fit well the average trajectory, there could be interest in testing the difference in slopes. Here the only change is that we need to switch the variable "period" by "time" in the interaction, since time is the continuous version.
```{r}
k4 <- lmer(chol ~ time * treat + basechol + age + sex +
(1 |subject),
control = lmerControl(optimizer = "bobyqa"),
data = long )
anova(k4)
```
We can plot it.
```{r}
dataplot <- emmeans(k4, ~time*treat, at = list(time = c(0,6,12,20,24))) %>%
as.data.frame()
dataplot
ggplot(dataplot,(aes(x = time, y = emmean)))+
geom_line(aes(group = treat, color = treat, linetype = treat))+
theme_base()
```
The *visreg* package plots partial residuals and may also give a nice plot.
```{r}
visreg::visreg(k4, xvar = "time", by = "treat", overlay = TRUE)
```
We will stop here.
Mixed models is a very broad class of models and the natural extension from here are the Generalized Linear Mixed Models, which can handle binary and count data, for example, among others. These can be fitted using *glmer* function from the same *lme4* package.
One can also see these models in the context of multilevel model, where you may have predictors at different level and nested random effects. For example, you could have students within classes within schools, and you could have class level variables (like numebr of students, size, light...) and school level variables (like average income in the region, size of the school, location of the school). These things make the model more complicated, but they are still adusted as Mixed Effect Models.
Generalized Estimting Equations (GEE) are another class of models for longitudinal and correlated data, which can be adjusted with the package *geepack*.
There are several other packages in R that handle random effects.
That is it for this workshop, thank you for the interest!
Marcos