We have a mailing lest! This is an effort to create a CAMH community that is interested in all things Statistics, Data, Evidence and Research in general.
Currently our mailing list is used to send a weekly update, where we include everything we find that may be of interest to researchers at CAMH. We have a Workshpo section where we inform about upcoming workshops that we offer, among other events. We have a Resources section where we add things we stumble upon in the internet (papers, blogs, videos) that are related to statistics or research and may be of interest for you. We have a Statistical Education section where we talk briefly about a new topic each week, usually related to the type of biostatistics that researchers at CAMH use. We have a Software section where we try to inform you about software related stuff, maybe some nice script, for example, or a nice R package.
You can subscribe to our mailing list by going to this website!
Below we have a list of the topics that appeared in Stats News.
054 (28/April/2020) – Centering predictors in Regression Models
This week we got a couple of questions about centering variables in regression models. This is a subject that is always around so we will comment on it briefly.
First, lets define centering. It is usually defined as the transformation that implies subtracting the mean from the original variable. Let’s say that you have the variable Age. You first calculate the average age of your sample, say it is 35 years old. Then for each subject, you subtract 35 from their age. So, a subject who is 25 years old will become -10 and a subject who is 50 will become 15.
What happens is that now the average of the transformed Age will be 0. The Standard Deviation of the transformed Age will not change.
This should not change the model inferential results, in general, in that you will get the same p-values as compared with the non-centering version of the variables. However, if you have continuous variables involved in interactions, they may become easier to interpret, and in particular the main effect of variables involved in interactions will be interpretable when variables are centered.
Imagine that you have Age and BMI, and you centered age at its mean of 35 and BMI at its mean of 28. And you have a linear regression model with only these two variables predicting depression score (DS).
If the model does not involve interaction, then the model intercept is interpretable as the average DS for subjects with Age and BMI at 0 (that is, average Age of 35 and BMI of 28). If you had not centered Age and BMI, the intercept would not be meaningful because there is no subject with Age = 0 and BMI = 0 in the non-centered version of the variable.
If the model has an Age * BMI interaction, then the coefficient of the interaction is how much the Age effect changes when the BMI value changes from 0 (that is, 28) to 1 (that is, 29). The way I think of it (this may be helpful to some) is that if you make BMI = 0, then the Age main effect coefficient is just the Age coefficient because the Age * BMI interaction become Age * 0 and vanishes. If you make BMI = 1, then now you have Age main effect coefficient (you also have the BMI main effect coefficient, of course), and the Age * BMI is Age * 1, that is, the interaction coefficient became another Age coefficient. So, for BMI = 1 the Age coefficient is the sum of the Age main effect coefficient and the Age * BMI coefficient. You can then see that the Age * BMI coefficient is just how much the Age slope increases when you move from BMI = 0 to BMI = 1.
The above paragraph plays out basically the same way if we focus on BMI effect instead of Age, that is, the Age * BMI coefficient is how much the BMI slope changes when we move from Age = 0 (35) to Age = 1 (36). If Age and BMI were not centered, this interpretation would not be possible.
So, in short, if both Age and BMI are centered, the main effect of Age is the slope of Age for subjects at centered BMI = 0 (that is, its average BMI of 28) and the Age * BMI coefficient is how much that slope changes when you go to centered BMI = 1 (that is, BMI = 29). You can easily get a sense of how much the slope of Age (the effect of Age) changes as we change BMI (how BMI moderates the effect of Age), and what is the effect of Age for an average BMI individual (coefficient of main effect Age).
An important point to make is that centering does not have to be at the average, it can be at any value that is relevant. For example, I could center Age at 50 instead of at its average 35, if I am more interested on effects of variables at 50 years old rather than the average Age.
Most of the time the models I see at CAMH have no interaction and so there is not much point to centering predictors. That said, centering is hardly harmful, I mean, it may take you some extra time to do, plus you have to remember that you are working with centered BMI (so that BMI = 25 will probably not exist in the new scale), but those don’t tend to be consequential things.
Finally, I want to discuss one thing which has been contentious, which is that centering helps with multicollinearity. That is the case only if you have interactions in the model, but you are still interested in the main effects. The coefficients of the interaction, and p-values, will not change whether you center the variables involved or not, but the coefficients of the main effects will. You will have in general more precise estimates of main effects in models with interactions, if you center your variables. However, most of the time you are really interested in the interaction, not main effects, and so, centering will not make any difference. Again I want to mention that if you center your variables, that will not harm you. A paper I like that talks about centering and multicollinearity is this one. As evidence of the contentiousness of the subject, someone criticized the paper and the authors published this short follow-up. These papers are nice if you want to get a little deeper into the subject.
We intend to soon follow up with a complement to this text where we want to talk a little bit about changing the scale of the variables (like when you divide by the standard deviation and standardize, rather than center, the variable).
053 (21/April/2020) – Time Series Analysis
Today we will have a brief introduction to Time Series Analysis, which is not very common among the things we do at CAMH, and that being so we find that many folks are unaware of this statistical methodology, and sometimes mistake it for other techniques related to more usual types of longitudinal data analysis. Here we will explain a little bit of it, and in the next section we provide a nice resource. A nice R related text on Time Series Analysis in R can be found here.
Longitudinal Analysis is often said of the types of techniques that handle data correlated in time, but with few time points. Its goal is pretty much never projection into the future (forecast), as we have little statistical information for that with few data points in time. Instead, the goals of longitudinal analyses are most often to understand change in outcomes between specific time points. This can be done in different ways: by comparing outcome changes or trajectories between groups like intervention and control group; by just estimating changes in time, like from beginning to end of the trial; by gathering evidence for effect of moderators on outcome changes or trajectories; by studying patterns of trajectories; etc.
Although the term “trajectories” is often used in the context of longitudinal data analysis, these are not long sequences of data points, it can be just a two-time points trajectory, and we hardly see more than 4 or 5 time points trajectories. It is also often across many sampling units, which are usually human subjects (that is, you have many trajectories in a single data set, one for each subject).
Sometimes, though, you may face a sequence of data in time that is composed of many time points, say, 30 or more. That is when you think about Time Series Analysis. Because the cost of data collection repeatedly for the same subject used to be expensive, time series data is traditionally not collected at the subject level. For example, you can have daily number of Emergency Department visits as a time series. Or monthly sales of a certain medication. Or weekly number of traffic crashes in Toronto. Or the number of folks entering a subway station at each 10 minutes.
You can see that once you have a series of data points, the type of analysis you can do will need to be very different than the usual regression-based longitudinal data analysis. Even though regression models are so amazing that you can still use them for time-series analysis, specific models have been developed for this kind of data, namely, Time Series Analyses techniques. The usual main goal you have will also be different as it relates to future forecasts, not to specific data points and specific changes, except if you are interested in studying interventions at specific time points and how they affect the series. For example, one could be interested on studying the effect of cannabis legalization on a time series of monthly cannabis consumption.
Probably the most popular time series models are the ARIMA models. They tend to work well in many situations. The idea is to build in an equation a model for the current time t data point (dependent variable) that uses the previous time t-1, t-2, t-3… data points as predictors, as well as previous error terms e(t-1), e(t-2), e(t-3), … . Since data points that are lined up in a sequence tend to be highly correlated, so will be the error terms, and this correlation means that the error (residuals) may carry useful information about the next data point. You can see that in such models a point can be seen as being used twice, as the outcome and then as a predictor for the data points in the future.
ARIMA models can be extended to include seasonal effects as well as intervention effect. Imagine that you have daily visits to Emergence Department at CAMH as your time series data. A Time Series model will show you that the number of visits tomorrow is correlated to the number of visits today, possibly yesterday, possibly 7 days earlier (a weekly seasonal effect), and even maybe 365 days earlier (a yearly seasonal effect – this will be the case if you notice seasonal patterns in the data, like more visits in the winter or summer). In a given day you may add a reception desk to help visitors, and once you have a reasonable amount of daily data after such intervention you can test it, that is, see if adding the desk changed the series in any way. Besides the series of daily visit, maybe you also have daily average temperatures or rainfall, and so there are models that can be used to test those effects, including with lagged effect (for example, the temperature today may have an effect on the number of visits tomorrow).
Even though there are models that deals with many time series simultaneously, that has not been the usual case. However, with data collection becoming easier and more affordable, such situation is becoming more mainstream. As an example, you can think of wearables devices, and you getting the data from those devices at every minute, for 200 subjects, over the period of 10 days. Data like this can be explored in different ways, but we will have to go outside of the traditional ARIMA model. One possibility is the so called Dynamic Structural Equation Models. But we may also have very insightful results with simple descriptive analyses of trajectories, like by studying the changes, slopes, peaks, visualizing, etc. When you have many subjects, the goal can be extended from forecasting to comparing groups of subjects, and to some sort of personalized medicine.
Okay, we did not intend to get too deep into this, but just give you a sense of what kind of data can be handled with these models, as well as types of research goals. So, don’t hesitate reaching out if you think you can use time series but are not sure how.
052 (14/April/2020) – Measurement Error
The journal Statistics in Medicine has published two papers in its Tutorial in Biostatistics section that brings a comprehensive treatment of measurement error. The first paper introduces different types of measurement error and the second paper looks at advanced methods to deal with it. We will talk a little bit about it, but by no means we would be able to cover the full content of the papers in any way that could be called reasonable because it is advanced and long. Instead we will just get you aware of some points we think are more important for our research.
Measurement error is what the name implies: when you collect your data, you are unable to measure things precisely for different reasons. The extent to which your measure is not precise will affect the statistical analysis that you will do, effect that will always be negative since measurement error is a source of uncertainty: the lower the better.
Maybe the most intuitive thing that happens is that you lose power. Let’s say that you want to measure subject’s age, but for some reason like the subjects not being comfortable telling you their age, you don’t get the age precisely correct. This will make it harder for you to detect effects related to age, that is, it will decrease power.
However, I would say that losing only power is the best-case scenario. If subjects tend to understate their age, for example, meaning that what you get is not only an age variable that is not precise in random ways, but an age variable that is biased downwards, then the consequences can go beyond losing power to straight biased estimates and conclusions from statistical analysis. That can be the case even if your measurement error in age is just random (some subjects understate and other overstate their age).
The classical example of a problem caused by measurement error is when you collect age variable that is not precise and use it as a regression predictor. One of the little mentioned regression assumptions is the assumption that the predictors are measured without error, and you only have error in the outcome. The measurement error present in age will cause the coefficients of age in a linear model to be biased downwards, and to the extent that age is associated to other predictors in the model it will also affect the coefficient of the other predictors. This is problematic, even if you can measure your predictor without bias, coefficients of linear models will be biased, and it is hard to say what happens if the model is not linear.
If you want to adjust for the measurement error in age (or in a regression predictor in general), you will need to know something about that measurement error. When measurement error does not exist the reliability is 1, that is, your measure has perfect reliability. So, knowing the reliability of a predictor is one way to know something about the measurement error. If you have such reliability, one way to account for it in regression models is to fit it in the context of structural equation models, where you create a latent variable age which ahs observed age as its only indicator and which will have its variance adjusted by the known reliability of age.
Usually there is little we can do about measurement error and we end up assuming it does not exist, even if sometimes we don’t realize that we are making this assumption. But it is an important one, particularly revealing in cases where your measure has poor reliability. It is not uncommon for us to work with data that we know is not reliable.
In the two papers, they go deeper in the theory of measurement error, including by defining different types of errors, different methods of adjusting for it and software. And just to mention here, when you measure age in years instead of measuring the exact age, you get a different type of measurement error called Berkson’s error, and in this case linear model coefficients are not biased.