---
title: "Binary and Count outcomes"
author: "Marcos Sanches"
date: "`r Sys.Date()`"
output:
rmdformats::readthedown:
highlight: kate
toc_dept: 4
---
```{r setup, echo=FALSE, cache=FALSE}
library(knitr)
library(rmdformats)
## Global options
options(max.print=100)
opts_chunk$set(echo=TRUE,
cache=TRUE,
prompt=FALSE,
tidy=TRUE,
comment=NA,
message=FALSE,
warning=FALSE)
opts_knit$set(width=75)
```
# Objective
Last week we learned linear regression, which works well with continuous and normally distributed dependent variables. However, many times the dependent variables have other distributions.
Dichotomous outcomes are very common in the research we do at CAMH. For example, we may want to estudy remission, relapse, suicide attempt, smoking cessation, suicide attempt and many other important outcomes. Logistic regression will usually be the model of choice to handle this special type of variable.
Counts outcomes are also very common. Examples are number of drinking days, number of visits in a month, number of cigaretes smoked in a day and number of drinks. If the counts are large they may be normally distributed and a linear model may work well, but often times they are small, often with lots of zeros. FOr these we need count models like Poisson and Negative Binomial Regression.
In this workshop we will learn the basics of these models.
# A new RStudio project
Here is the steps to create a new RStudio project:
1. Create a folder for each data analysis project.
2. Open Rstudio.
3. Go to File > New Project > Existing Directory
4. Navigate to the folder you created
5. Click Create Project. A “…Rproj” file will be created in the folder. Next time, you double click this file to open your project.
6. Go to File > New File > R Markdown (or R Script) and click Ok.
7. Save the new R Mardown file. Here you will store your R script.
8. Make sure you store all your scripts. Ideally, you want to have in scrip all single steps from reading the data to final results.
# Cleaning R environment
Again we clean the R environment - we can do that because we are confident the script we will create will reproduce all relavant analysis, we dont need to keep any dataset or analysis results other than the original dataset!
```{r}
rm(list = ls())
```
# Reading a SPSS dataset
For today's workshop we will use the SPSS dataset "Drinking data.sav". We read it with package *haven*. Here is a short description of the data.
id - Unique identifier
group - Study Group (0 = Placebo, 1 = Vitamin D)
age - Age in years
eth - Ethnicity (1 African American, 2 Asian, 3 Caucasian, 4 Native American, 5 Other)
genotype - (0 AA, 1 Aa, 3 aa)
sex - (0 Male, 1 Female)
remission - Depression Remission (0 No, 1 Yes)
n_drinks - Total number of drinks on 3 last days of the trial
n_cig - Total number of cigarettes on 3 last days of the trial
n_drinking_days - Number of days in the follow up period with at least one drink
ndays - Number of follow up days.
```{r}
dt <- haven::read_sav("Drinking data.sav")
names(dt)
```
# Load packages
Here we load the packages that we will use in this workshop.
```{r}
library(tidyverse)
library(emmeans)
library(haven)
library(ggthemes)
library(sjstats)
library(sjPlot)
library(kableExtra)
library(sandwich)
library(lmtest)
library(estimatr)
library(splines)
```
As the data came from SPSS, we can use *as_factor* to transform some variable sin factor and make interpretation easier. We also give the variables better names. It is important to ensure the variables we will use are in proper format.
```{r}
dt2 <- dt %>%
mutate(gen = as_factor(genotype),
treat = as_factor(group),
rem = as_factor(remission),
Eth = as_factor(eth),
sex = as_factor(sex)) %>%
select(-genotype, -group, -eth)
```
# Logistic Regression
Logistic regression is used when we want to explain variables that are binary. In our dataset one such variable is **rem** which is the remission status of the subject at the end of the trial.
**Research Question** Is there a difference in the proportion of remission between the two treatment groups?
Let's start by assuming that the treatment group was randomized to subjects. If so then all we need is a simple regression model where the predictor of interest is the treatment group and the outcome is remission.
## Univariate Model
As we have only one predictor, this is a very simple logistic regression. There is not much difference between using such model and a simple crosstable.
Before getting into analyzing the data with our univariate model, lets get an idea of the **binomial distribution** which is the distribution that characterizes the logistic model (as the normal distribution characterizes the linear model).
### The Bonomial Distribution
The function *glm* with family = *binomial* and link = *logit* is going to give us the traditional logitisc regression.
**Bonomial** - Linear models assume normal distribution. Logistic Regression assumes *Binomial Distribution* with *logit* link, where logit is the logarithm of the odds.
The Binomial distribution models the number of success in a certain number (size) of events, given that each event has a probability *p* of success. For example, flipping a fair coin 10 times can be modeled by a Binomial distribution with size = 10 and p = 0.5. Lets simulate it in R.
```{r}
# 1 experiment
rbinom(n = 1,size = 10,p = 0.5)
# 100 experiment
rbinom(n = 100,size = 10,p = 0.5)
```
The situation where have a 0/1 (Yes/No) outcome is a Binomial experiment with size = 1 and we want to model the probability p. Each subject is a new experiment.
```{r}
# Simulating one subject in our dataset. Assuming prob remission is 0.2.
rbinom(n = 1,size = 1,p = 0.2)
# Simulating 100 subjects, with probability of remission 0.2
rbinom(n = 100,size = 1,p = 0.2)
# Just checking quickly the proportion of 1.
sum(rbinom(n = 100,size = 1,p = 0.2))/100
```
In the synthetic data above, the probability of remission is 0.20. What is the odds of remission?
Odds = probability/(1 - probability)
= 0.2/(1-0.2)
= 0.25
The log-odds is the natural logarithm of the odds and is the target of the logistic model.
log-odds = ln(0.25) = -1.3863
For small probabilities, let's say, lower than 0.05, the odds and probability will be very similar. However, their difference increases as the probability increases.
This is important because traditional logistic models look at log-odds, and odds. They dont look directly at probabilities.
Here we generate the same binomial distribution as above, but in a dataset. The we estimate a logistic regression with intercept only (because the dataset does not have any variable other than our remission).
```{r}
set.seed(2222)
data <- data.frame(remission = rbinom(n = 1000,size = 1,p = 0.2) )
head(data,20)
g1 <- glm(remission ~ 1, data = data, family = binomial(link = logit))
summary(g1)
```
We get a log-odds estimate of -1.34310. It is not very different from the -1.3863 that we got above. That is because of random fluctuations, we dont have exactly 200 remissions and 800 non-remissions.
```{r}
# Number of cases of remission.
sum(data$remission)
```
If the log-odds is -1.3431 then the odds is exp(-1.3431) = 0.261.
And if odds = p/(1-p) then p = odds/(1+odds) = 0.261/1.261 = 0.207, or 20.7%.
20.7% is the losgistic regression estimate for the probability of remission.
Instead of doing all the calculation, we can just feed the logistic model to the "emmeans" function from the "ememans" package to get probabilities (estimated marginal means for logistic regression). Type = "response" means that we want the result in terms of probabilities, not odds or log-odds.
```{r}
emmeans(g1, ~1, type = "resonse")
```
The traditional way to interpret logistic regression coeffients is by taking the exponential of the coefficients. That will give you odds ratio (or odds if the coefficient is the intercept). However, probabilities are much more intuitive and packages like "emmeans" can help you to translate logistic regression results into probabilities.
### A simple table
Back to our research question - is the probability of remission different in Placebo and Vitamin D groups? Logistic regression will test that via log-odds.
But first we take a look at these variables.
```{r}
# Count
table(dt2$treat, dt2$rem, useNA = "always")
# Propportion.
prop.table(table(dt2$treat, dt2$rem),1)
```
### The model
Now we run our model that aims at testing if there is a difference in probability of remission between treatment groups.
```{r}
g2 <-glm(remission ~ treat, data = dt2, family = binomial)
summary(g2)
```
Here, -0.2267 is the log-odds of remission for Vitamin D **relative to** the reference group, which is Placebo. It is not significant, but let's assume it is, so that we can go ahead and try to interpret this log-odds.
This is difficult to interpret, so we exponentiate it.
```{r}
# We can exponentiate the coefficients of the model
exp(coef(g2))
# or use tab_model, from sgPlot package
tab_model(g2)
```
The exponentiated log-odds of remission in the Vitamin D group relative to Placebo becomes the ratio of odds for the Vitamin D group relative to the Placebo group. The Odds Ratio is then 0.80.
It means that odds (remission)/odds(placebo) = 0.80. So, Since the odds ratio is lower than 1, it means that odds of remission is lower for Vitamin D than for Placebo.
The odds ratio is a measure of effect size, it indicates how much larger the odds of one group is relative to the other. In this case it is 0.80 or 20% lower. The conclusion is that the odds of remission in the Vitamin D group is 0.80 of that on the Placebo group, or 20% lower than the odds of remission in the Placebo group.
We can use *emmeans* function from *emmeans* package to calculate the log-odds, and what is more intuitive, the probability of remission in each group. Here we have the log-odds of remission in both groups, given by the model.
```{r}
emmeans(g2, ~treat)
```
And here we have the probabilities of remission in both groups.
```{r}
emmeans(g2, ~treat, type = "response")
```
It is always nice to present the results of the logistic regression using an errobar plot with the estimated probabilities.
```{r}
plotdata <- emmeans(g2, ~treat, type = "response") %>% as.data.frame()
ggplot(plotdata, aes(x = treat, y = prob))+
geom_point(size = 2)+
geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = 0.1) +
ylim(0,0.25)+
ylab("Model estiamted probability of remission")+
xlab("Treatment Group")+
ggtitle("Logistic Regression Model Results")+
theme_clean()
```
Since the logistic model has only one independent variable, it will be equivalent to a simple table. Notice that we get the same probabilities of remission.
```{r}
prop.table(table(dt2$rem,dt2$treat),2)
```
The logistic regression p-value will be close to the chi-square pvalue. But there is another test for tables that are not very popular, but equaly valid, the likelihood ratio test, which will provide the same p-value as the logistic regression.
```{r}
# Chi-square test
chisq.test((table(dt2$rem,dt2$treat)))
# Likelihood Ratio Test
DescTools::GTest(dt2$rem,dt2$treat)
```
## Multivariate Model
Now lets assume the same research question in a scenario where our groups were not randomized and we need to control for age, sex, the genotype and ethnicity. Now our model will have many predictors: the predictor of interest and the controls.
**Note** - you need to make sure you properly defined factor variables as factor and continuous variables as numeric before you add them to the model. That is how R knows which variables are categorical and which are continuous.
```{r}
g3 <- glm(remission ~ treat + age + sex + Eth + gen,
family = binomial, data = dt2)
summary(g3)
car::Anova(g3, type = 3)
tab_model(g3, show.se = TRUE,
title = "Logistic Regression Coefficient")
confint(g3)
```
### Interpretation
To interpret model coefficients, always use the exponential of the coefficients and interpret in terms of odds ratio. "tab_model" from "sjPlot" package will give you the exponential of coefficients.
**Categorical Variables** - Coefficients are the odds ratio for the category relative to the reference group. Females have odds of remission 2.25 times higher than males (or 115% higher than males). Asians have odds of remission 0.19 times of that of African Americans (ref.), or 81% lower than African Americans. Aa have odds of remission 1.31 times that of AA, or 31% higher.
**Continuous Variables** - Here the reference is "one unit less". So, you can think of the odds ratio that results when you increase the continuous variable by one unit. THe interpretation can be like this: the odds of remission increases by 0.97 (or decreases by 3%) for each year age increases, in average. That is, the odds of remission for a 31 years old is 3% lower than that of a 30 years old. Same for 44 years old relative to 43 years old.
### Probabilities
But how about reporting probabilities instead of exponentiating the coefficients to get odds ratio? Yes, that is usually a great idea, and we can use the "emmeans" package and its function.
Here is the same plot as before, we probabilities of remission and their precision. Notice that the probabilties are adjusted for the controls in the model, so they are now different. We call them "marginal probabilties" or "adjusted probabilties".
```{r}
plotdata <- emmeans(g3, ~treat, type = "response") %>%
as.data.frame()
plotdata
ggplot(plotdata, aes(x = treat, y = prob))+
geom_point(size = 2)+
geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = 0.1) +
ylim(0,0.25)+
ylab("Estiamted Marginal probability of remission")+
xlab("Treatment Group")+
ggtitle("Logistic Regression Model Results")+
theme_clean()
```
Here is the same graph for ethnicity. This way of presenting results is particularly interesting when we have many groups.
```{r}
plotdata <- emmeans(g3, ~Eth, type = "response") %>%
as.data.frame()
plotdata
ggplot(plotdata, aes(x = Eth, y = prob))+
geom_point(size = 2)+
geom_errorbar(aes(ymin = asymp.LCL, ymax = asymp.UCL), width = 0.1) +
ylim(0,0.80)+
ylab("Estiamted Marginal probability of remission")+
xlab("Ethnicities")+
ggtitle("Logistic Regression Model Results")+
theme_clean()
```
## Grouped Logistic Regression
Remember that remission is analogous to *flipping a coin once*. Each subject either achieve remission or does not, and we are interested in factors that affect the probability (or odds, or log-odds) of remission, like treatment, age, genotype...
This is the most common situation, but there is also the situation where each subject is analogous to *flipping the coin multiple times*. If we follow the subject up for 10 days and count the number of days with at least one drink, that is analogous to flipping the coin 10 rimes and counting the number of heads.
Notice that the number of drinking days in this example is limited to 10, and because of this it is not a count outcome, but a binomial one.
A logistic regression will still model the probability of a drinking day, as it modeled the probability of remission.
Here is it.
```{r}
g4 <- glm(cbind(n_drinking_days,ndays - n_drinking_days) ~
treat + sex + age + gen + Eth, data = dt2,
family = binomial)
summary(g4)
```
The interpretation here is the same as in the previous logistic model. For example, the log-odds of a drinking day is 1.06 lower in the Vitamin D group. If we calculate exp(-1.06) = 0.346, indicating that the odds of a drinking day in the Vitamin D group is 0.35 of that in the placebo group (or we can say that the odds of a drinking day i 65% lower in the Vitamin D group than in the placebo group).
But notice the line:
*(Dispersion parameter for binomial family taken to be 1)*
With groupped data like this, the dipersion (or variance) of the binomial distribution may be different than the variance in the data. This means we have to be careful with overdispersion, or underdispersion.
If we divide the residual deviance by its degree of freedom and it is far from 1, we may have problem with dispersion. In our cae we have 1589/329 = 4.83, indicatinf overdispersion.
When this happens, the usual approach is to use the "quasibinomial" distribution, which is not really a distribution, but a modification to the binomial distribution that allows its dispersion to change and adapt to the data.
```{r}
g5 <- glm(cbind(n_drinking_days,ndays - n_drinking_days) ~
treat + sex + age + gen + Eth, data = dt2,
family = quasibinomial)
summary(g5)
tab_model(g4,g5,show.se = TRUE,
title = "Comparison between model with and without dispersion parameter",
dv.labels = c("Binomial Model, no dispersion parameter",
"quasi-binomial model with dispersion parameter"))
```
Notice that now the dispersion parameter is estimated, that is, accounted for. The effect this have in the model is much larger SEs and confidence intervals, and as consequence, large p-values too.
It is safer to use the "quasibinomial" model whenever we have this kind of grouped data.
Here we can still use the "emmeans" package to trnasform the odds into probabilities. In this case the probability of a drinking day is 0.47 in the Vitamin D group and 0.72 in the Placebo group, after controlling for the other variables.
```{r}
emmeans(g5, ~treat, type = "response")
```
# Count data models
Count data is usually defined by outcome variables that cannot be lower than zero and often assume low values. Large counts are also quite natural, but they often fit well enough with the linear normal model.
## Poisson Distribution
The most popular count distribution is the Poisson distribution, which we can easily simulate in R. We need to specify the "mean", in the "lambda" parameter.
```{r}
rpois(n = 100, lambda = 4)
```
And hire we visualize its distribution. Notice that it is skewed to the right.
```{r}
hist(rpois(n = 1000, lambda = 4))
```
## The Poisson Model
So, whenever we have count data, we can try the poisson model. It will naturally be adjusted in the log scale.
```{r}
g6 <- glm(n_drinks ~ treat + sex + age + gen + Eth, data = dt2,
family = poisson(link = log))
summary(g6)
```
### Interpretation
The log number of drinks in the vitamin D group is 0.075 higher than in the Placebo group (but not significantly higher under 0.05 threshold). As age increases by 1 unit(1 year), the log number of drink increases by 0.0186.
Since it is hard to understand effects in the log scale, here too we exponentiate the coefficients in order to get a better interpretation.
```{r}
tab_model(g6,title = "Poisson Model")
```
Here we can say that the number of drinks is 1.08 times higher in the Vitamin D group, or 8% higher. Or 9% lower among Females as compared to males. Or it increases 2% for every extra year of age.
### Overdispersion
Notice that here too the dispersion parameter is fixed at 1, it is not estimated.
```{r}
summary(g6)
```
The residual deviance divided by its degress of freedom is much closer to 1, indicated that we dont have much problem with dispersion:
414/328 = 1.26
But still it is safer to account for the dispersion parameter, which we can do by using the "quasipoisson" family.
```{r}
g7 <- glm(n_drinks ~ treat + sex + age + gen + Eth, data = dt2,
family = quasipoisson(link = log))
summary(g7)
```
We see that this time the increase in the standard error fo the coefficients is small.
```{r}
tab_model(g6,g7,show.se = TRUE, digits = 3,
title = "Comparison between model with and without dispersion parameter",
dv.labels = c("Poisson Model, no dispersion parameter",
"quasi-poisson model with dispersion parameter"))
```
# The Negative Binomial Model
The negative binomial model is the model used by the number of times you have to flip the coin until you find a certain number of heads. But for the purpose of modeling, the important thing is just that it model count data, just like the poission model. But it is preferred by some people because it does not have an overdispersion problem, as the negative binomial distribution includes a dispersion parameter (the poisson and binomial does not). Like the poisson, it also operates in the log scale by default.
It can be adjusted using the "blm.nb" function from the "MASS" package.
```{r}
g8 <- MASS::glm.nb(n_drinks ~ treat + sex + age + gen + Eth, data = dt2 )
summary(g8)
```
Remembering that the negative binomial model is in the log scale, the interpretation of its coefficients is the same as in the poisson model.
# Model assumption
We will list here a few things to be aware of, when fitting any of these models.
## Linearity
Continuous variables like age are modeled as having a linear association with the log-odds, or log-count. Like in the linear model, you can take a look at hese assumption by using a scatterplot with log-count and age and visually inspect if it looks linear.
Here we see that the linear model does not really depart much from the loess fit, so that a linear fit looks okay.
```{r}
ggplot(dt2, aes(x = age, y = log(n_drinks)))+
geom_point(color = "gray80")+
stat_smooth(method ="lm", se = F)+
stat_smooth(se = F)+
theme_bw()
```
For log-odds, with 0/1 outcome, it is a bit more complicated, because we cannot calculate the log-odds for each subject and plt against age. Remember that odds = p/(1-p), but if p is either 0 or 1, the odds will be either 0 or infinite.
One thing we can do, if we have a reasonable sample size, is to split the sample into age groups and for each group calculate p and the log-odds. Then we can plot it to see if it looks linear.
Here is an example.
```{r}
# We first create age groups, based on quantiles, so that each group have similar sample size.
dt2$cage <- cut(dt$age,breaks = quantile(dt2$age,probs = seq(0,1,0.1)))
# For each age group we calculate the probability, then the odds.
agg <- dt2 %>%
group_by(cage) %>%
summarise(p = mean(remission),
age = median(age), #we will use the median age in each group for the plot
N = n()) %>%
ungroup() %>%
mutate(logodds = log(p/(1-p))) %>%
filter (p>0,p<1) # if p=0 or 1, the log odds cannot be calculated.
# Take a look at the aggregated data.
agg
# Now we can plot.
ggplot(agg, aes(x = age, y = logodds))+
geom_point(size = 3, color = "gray60")+
stat_smooth(method ="lm", se = F)+
stat_smooth(method ="lm", se = F, formula = y ~ x + I(x^2))+
# stat_smooth(se = F)+ # does not work well with few points.
xlab("Grouped age")+
ylab("Log-Odds of Remission")+
theme_bw()
```
Thisis an interesting situation because there is some evidence of non-linear relationship. We can try to add a quadratic age term to the model and compare with the model with only linear term.
```{r}
g3a <- glm(remission ~ treat + age + I(age^2) + sex + Eth + gen,
family = binomial, data = dt2)
summary(g3a)
AIC(g3,g3a)
anova(g3,g3a, test = "LR")
```
We see that neither the AIC decreases, not the Likelihood Ratio test is significant, meaning that adding the quadratic term to the model does not seem to improve it.
We could also have tried cubic splines. But again, it does not seem to improve the model. However, if age is a control variable it is not a bad idea to use a model more flexible than the linear even it is not significant (not significant does not mean that non-linearity does not exist, it means we dont have strong evidence for it, but it could exist. Well then, we can make sure we account for it if it exist by usign a spline model, for example).
```{r}
g3b <- glm(remission ~ treat + bs(age) + sex + Eth + gen,
family = binomial, data = dt2)
summary(g3b)
AIC(g3,g3b)
anova(g3,g3b, test = "LR")
```
## Outliers and Influential points
Outliers are points that have high values for the resiudals, while influential points are those that affect the model more than expected. If a point is an outlier but doe snot affect the model, then that is not much of a deal. However, influential points are always concerning to the extent the influence the model.
We can follow on the lines of what we did for linear model, but here I will just take a look at the Cook's Distance, a measure of influence. This can be done for count or dichotomous data models.
Here we take a look at out logistic regression.
```{r}
dt2$cook <- cooks.distance(g3)
th <- 4/(g3$df.residual) # Rule of Thumb for threshold for Cook's Distance
th
ggplot(dt2, aes(x = cook))+
geom_histogram(bins = 50)+
theme_minimal()
```
In a visual inspection, maybe points with Cook's Distance above 0.03 would be selected as concerning. The threshold defined by the rule of thumb indicates a lower threshold of 0.012.
As you can see, the threshold indicates that many points are problematic. But if there are many points then it is hard to call them atypical and not be concerned about removing them all. For this reason I tend to prefer the visual inspection.
We could now refit the model without the influential points and see if our conclusions change. We could do the same for leverage and residuals.
## Zero cells
One issua we often find with logistic regression is that of zero cells, also callce complete separation. This could happen, for example, if one of the categorical variables had only outcome = 0 or 1. For example, we have a small sample of Ethnicity = Native American, of size 17. If all the 17 Native Americans were status remission = 0, or all were remission = 1, then we would probably have issues with our logistic model.
To aovid such issue we can group categories of certain variables where the sample size is small.
## Overdispersion
This is a very important issue that we often have with Poisson regression and the logistic regression with aggregated data. You need to always keep an eye on that, because overdispersion can easily make you say that an effect is significnat when it is not (it causs the model to underestimate the standard error, or overestimate the precision of coefficient).
## Effects for continuous variables
All these models are non-linear. They are either in the log-scale (count) or log-odds scale (dichotomous). In this case, the effect of continuous variables are better presented in graphs as just looking at model coefficients may not give an entire idea of the effect over the interval of the continuous predictor.
### Predicted Value - logistic model
One way to look at the effect is by using predicted values. For example, lets take the age effect for our logistic model.
Here we run the model again.
```{r}
g3 <- glm(remission ~ treat + age + sex + Eth + gen,
family = binomial, data = dt2)
summary(g3)
```
One way to visualize the effect of age is to look at the probability of remission predicted by the model for each age. We can do that by building a new dataset where only age varies, and the other predictors are constant, so that we see what hapens when age varies. We can use the function "expand.grid" for that.
Notice that only age varies. All the other variables are fixed to a value, which does not matter much (you can try different value and see), but it is common for categorical variable to choose the group with higher sample size. If there are other continuous variables, they are usually fixed at their mean.
```{r}
# data set for predicted value
pdata = expand.grid(treat = "Vitamin D",
age = seq(from=18,to=80,by=1),
sex = "Female",
Eth = "Caucasian",
gen = "Aa")
pdata
```
Basically, we are assuming a set of subject that are equal in every variable, except age. Now we use the model to calculate the predicted probability of remission for each of these subjects.
```{r}
pdata$prob <- predict(g3, newdata = pdata, type = "response")
pdata
```
We can see that as age increases, the probability of remission decreases. But this is better seen in a plot.
```{r}
ggplot(pdata, aes(x = age, y = prob))+
geom_line()+
ylim(0,1)+
ylab("Predicted Probability of Remission")+
theme_minimal()
```
### Predicted value - quasipoisson model
We cna do exactly the same for poisson and quasipoisson models. First we create a dataset where only the variable of interest age varies.
```{r}
# data set for predicted value
pdata = expand.grid(treat = "Vitamin D",
age = seq(from=18,to=80,by=1),
sex = "Female",
Eth = "Caucasian",
gen = "Aa")
pdata
```
Then we predict the number of drinks.
```{r}
pdata$pdrink <- predict(g7, newdata = pdata, type = "response")
pdata
```
Then we plot it. We can see that as age increases, the number of drinks also increases, but not linearly.
```{r}
ggplot(pdata, aes(x = age, y = pdrink))+
geom_line()+
ylim(0,10)+
ylab("Predicted Number of drinks")+
theme_minimal()
```
### The "visreg" package
The "visreg" package has an interesting feature that allows you to show the data points in form of partial residuals. Here we see the same graph we did above, but with the points.
```{r}
visreg::visreg(g7,xvar="age", scale = "response", partial = TRUE,
xlab = "Age",
ylab = "Number of Drinks")
```
### The emmeans package
Likewise, we can use "emmeans" package to get marginal means at different age values.
```{r}
dataplot <- emmeans(g7, ~ age, at = list(age = seq(10,80,5)), type = "response") %>%
as.data.frame()
dataplot
```
Then plot it. It is the same as above.
```{r}
ggplot(dataplot, aes(x = age, y = rate))+
geom_line()+
ylim(0,10)+
xlab("Age in Years")+
ylab("Estimated Marginal Means - Number of drinks")+
theme_minimal()
```
# Interactions
Interactions in regression models has to do with the study of effect modifiers or moderators.
For example, the researcher could have some reason to think that the treatments would improve remisson, but this effect would be moderated by Sex, because theer are some theoretical reasons to believe that (meaning that we dont want to go around testing all possible interaction without a reasonable justification to do so).
**Research Question** - Does Sex moderates the treatment effect on remission? Does the effect of treatment depends on which Sex it is administered? Does the treatment effect varies depending on the Sex?
All the same question, which in could be addressed by an interaction between treatment group and Sex in the regression model. By using an "*" between Province and Treatment we get R to add the Treatment, Province and Treatment by Province interaction.
```{r}
options(max.print = 500)
g3i <- glm(remission ~ treat*sex + age + Eth + gen,
family = binomial, data = dt2)
summary(g3i)
car::Anova(g3i, type = 3)
```
Here we dont have much evidence that there is such interaction (it is not signfiicant at 0.05). But let's pretend it was significant, and now the next step would be to try to understand why. How is it that the treatment effect varies between sexes?
It is a little difficult to study the interaction using model coefficients. It is much better to use estimated marginal means, or in this case, probabilities of remission.
Here are the EMM:
```{r}
emmeans(g3i, ~treat*sex, type = "response")
```
We get the estimated remission probability for both treatments within eacch sex. It is not difficult to see what the difference is. Here we see that for males, the probability of remission is lower for Vitamin D, but higher for Females. It is like the treatment works only for Females. But keeping in mind that this is just an exercise, the interaction was not significant.
It is even petter to visualize it in a plot.
```{r}
plotdata <- emmeans(g3i, ~treat*sex, type = "response") %>% as.data.frame()
plotdata
ggplot(plotdata, aes(x = sex, y = prob))+
geom_point(aes(shape = treat, color = treat), size = 2,
position = position_dodge(width= 0.2))+
geom_errorbar(aes(ymin = asymp.LCL,ymax = asymp.UCL, color = treat),
width = 0.1, position = position_dodge(width= 0.2))+
ylab("Estimated marginal probability of remission")+
theme_minimal()
```
We can esily see now that even though the probability of remissionis lower in the Vitamin D group for males and higher for females, this is irrelevant compared to the much larger confidence interval bars.
When continuous variables are involved in the interaction, it is a little bit trickier.
Let's assume the research question has to do with Age as the moderator of the treatment effect, instead of Province. Here is the model.
```{r}
g3j <- glm(remission ~ treat*age + sex + Eth + gen,
family = binomial, data = dt2)
summary(g3j)
car::Anova(g3j, type = 3)
```
In this case the age by Treatment interaction is significant at level 0.05. The next question is then how is it that age moredates the treatment effect?
Using emmeans, we can calculate the probbaility of remissionin both treatments, at different ages, according to the model.
```{r}
emmeans(g3j, ~treat*age, at = list(age = seq(15,80,5)), type = "response")
plotdata <- emmeans(g3j, ~treat*age, at = list(age = seq(15,80,5)), type = "response") %>%
as.data.frame()
plotdata
ggplot(plotdata, aes(x = age, y = prob))+
geom_point(aes(shape = treat, color = treat), size = 2,
position = position_dodge(width= 2))+
geom_errorbar(aes(ymin = asymp.LCL,ymax = asymp.UCL, color = treat),
width = 0.1, position = position_dodge(width= 2))+
theme_minimal()
```
The interpretation of the graph would be that at younger age the treatment does not seem to work and in fact placebo seems to be better. However, as individuals get older, it seems that Vitamin D start having better performance than Placebo in the probability of remission.
Okay folks, that is it for today!
Thanks for the interest,
Marcos