You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

We have a mailing list! This is an effort to create a CAMH community that is interested in all things Statistics, Data, Evidence and Research in general. To subscribe to our mailing list, click here or contact Marcos. And feel free to mention the mainling list to your coleagues. 

Currently our mailing list is used to broadcast our weekly update (Stats News), where we include information that may be of interest to researchers at CAMH, and that is related to data or statistics. That includes education posts, papers and resources, Software tips and scripts.

We have a Statistical Education section where we talk briefly about a new topic each week, usually related to the type of biostatistics that researchers at CAMH use. 

Stats News Education Posts


057 (19/May/2020) – Change in Scale and Variable Standardization

  to Top ↑

Two weeks ago we talked about the role of variable centering in Regression Models. Today we continue that discussion by taking a look at the role of variable standardization and changes in scales in regression models. When we talked about centering and now we are focusing on regression predictors (independent variables) and not outcomes (dependent variables).

By changing the scale we mean to multiply or divide regression predictors by some value. This should not change the inferences about statistical significance of the independent variable.

A nice example of a change in scale is, for example, if you divide a variable Age in Years by 10. What happens is that now you have Age in 10 years, not in years. As a result, the expected effect of age will be 10 times larger in a linear model. This can be helpful because the effect of a single year change in age is often very small, and it is kind of awkward to report. Like, the coefficient of age was 0.0001 (SE = 0.000022). No, don’t report that, instead re-scale age! You will be able to get rid of one zero if you report the coefficient of Age in 10 years. And often it just makes sense that we don’t expect things to have relevant effect over 1 year and so 10 years become more reasonable anyway.

If instead of age we have income in dollars, then besides very small coefficients you may even run into convergence problem for some models and software. Say, what would be the effect of increasing 1 dollar in one’s annual income in terms of the probability of developing dementia? Such small increase in income does not even make sense, but it is what you are estimating if you use income in dollars. The coefficient of income will necessarily be tiny and can even cause computational problems. In this case it is better to measure income in $10,000, for example, so that your coefficient will be the effect of increasing the income by $10,000. You do that by just dividing income by 10,000. Or you could divide it by 20,000 if you think that is more reasonable.

The above are just changes in the scale of the variable, it is like you are using a different unit. The two examples above is what I prefer to do when working with well-known variables, which are variables that people often understand easily.

Sometimes we work with more abstract variables, and we don’t really have a rescaling that we would think of as reasonable. Maybe we are looking at some concentration of some marker, or amplitude of some wave, for example. It is not clear that dividing by 10, 20 or 500 is what makes more sense. In these cases, we may prefer to divide by the estimated standard deviation of the variable. If you divide income by 10,000 you interpret its coefficient as the effect of increasing $10,000 in income. If you divide by its standard deviation, you interpret its coefficient as the effect of increasing one standard deviation in income.

For ease of understanding, lets go back to age. Imagine that we are looking at low income households (between, say, $10,000 and $30,000 – I am making this example up, but bear with me) and we want to look at the effect of income on probability of being victimized. In this case it is reasonable to think that the standard deviation will be around $5,000, which is probably a reasonable change for this population where $10,000 may be unrealistic.

In my view, regardless of the variable we are talking about, a change of 1 standard deviation usually can be though as representing a meaningful, reasonable change, not too small or too large. But that is a rough interpretation, one you can thing in broad terms; you should not use it if you have a better definition of what would be a reasonable change, maybe a clinical definition. Also in broad terms you can think that for a variable in general, most of its values are in between its mean +/- 2 Standard Deviation, so they vary by 4 standard deviation and a change of 1 standard deviation is a sort of a change of around 20 to 25% in the possible range of the variable, which tends to a reasonable change.

Well, you need to be aware that statisticians will come up with these things if you don’t tell them what is a reasonable change for which you want to estimate the effect size. It is like you estimate the effect of depression as measured by some depression score, on some outcome. If we just use the depression score we will find the effect of increasing one point, which may be a to small, meaningless increase. So, you can define a priori that we will look at increases of 6 points because that is what is considered a meaningful change in the literature. It will make the results more interpretable from the practical point of view.

The other reason to rescale variables using their standard deviation is so that coefficients become comparable. Say, you may have both age and income as predictors in a model, and you get a coefficient for age and another for income, but then you want to compare them to have an idea if the effect of age is bigger than the effect of income. You cannot just compare coefficients because the effect of age is relative to a change of 1 year and the effect of income is relative to a change of, say, $1000 in income. So, the coefficients represent the effect of totally different things and cannot be compared. However, if you divide age and income by their standard deviation before adding them to the model, then both coefficients of age and income will be measured in terms of the same change of 1 standard deviation in both variables. This is considered more comparable, if you really need to compare them numerically.

I am saying this because I am usually against such comparisons, I tend to think that as a researcher you need to understand the effect of age and income, then think about both and do the comparison relative to all you know. For example, even if the income effect is quite small it is usually more important because income is something we can modify, age is not. Modifiable or not is an external thing that you don’t solve by dividing by the standard deviation, and if anything, such division will divert your attention from other things that may matter more, like whether they are modifiable.

If besides dividing by the standard deviation you also subtract the mean (center), we call it standardization, or z-score. In this case you also get the benefits of centering. Standardization will not make your variable more normal, which seems to be a common thought out there. Usually you don’t need to worry about normality of regression predictors, only of outcomes, and it is the normality of the model residuals, not the outcome itself, that is assumed by the model..

I will stop there. I will repeat here that we are talking about predictors, not regression outcomes. Another thing you can do with predictors are transformations, like the logarithmic transformation. These can be done to get rid of outliers, to make the model multiplicative, or because it is somehow relevant for interpretation, and also to address non-linearities which is a very important thing and we may talk about it sometime...


056 (12/May/2020) – The meanings of Statistical Significance in clinical trials    to Top ↑

Today we will comment a paper that came out in JAMA this past week. We encourage you to take a look at the paper because they have some nice examples of real trials that help getting the point across.
I though this was a nice paper to comment on because it makes cleat that the interpretation of a significant or not significant result may vary widely. That is just another way to say that being significant or not means very little by itself.

The key point that is emphasized in the paper is the need to take into account the cost related considerations always involved when stating that something is significant or not (which is interpreted as we are making a decision about the effectiveness of the intervention). Not only monetary cost is implied here, but all sort of costs from which important ones are possibility of harm to patients and suffering, as well as use of resources in general (time, equipment, bed that could be used by other patients…).

I think we can summarise it by saying that statistical significance offer you a certain amount of evidence, but the actual amount of evidence you want for a given treatment depends on a diversity of costs. We should demand more evidence from treatments that we know cause harm, or tie down lots of resources, because we don’t want to risk the incurrence of a high cost in exchange for no benefit.

In that sense a non-significant result may be enough evidence for us to change practice, if the treatment we are looking at is already known to cause some harm. We may then stop using such treatment. On the other hand, a non-significant result may still be inconclusive, and we may want to not disregard the treatment if it is something very cheap for which we see no unintended consequences, particularly if the trial design is not very good.

There are two important points that deserves further discussion in the paragraph above. The first is the considerations about using something like a treatment for which statistical evidences has not been shown. It sometimes has to be done because we don’t have strong statistical evidence. These considerations are things like ethics and the right of the patient to being informed. We need to consider all of that, the patient must be informed about the evidence to the extent of possible and we need to be careful that the use of such treatment is not interpreted as the treatment being backed by statistical evidence.

The other point is that the meaning of a significant result also depends on the trial design. And here we can get ourselves into a long discussion which is not the goal today. So, we will stick to a summary, which is to say that the statistical status in terms of the significance of a result may mean very little if the data quality is not good, if the power is low, if the trail was not properly conducted, if the measures are not the best ones, if the pre-registration does not exist or was not followed, if the statistical analysis was improperly conducted, etc. Yes, there are lots of things you must look at before you interpret that p-value.

Besides the cost and the technical aspects of trials, another important aspect is the amount of external evidence, particularly if the threshold used for declaring significant is 0.05, which is a very low bar. You can interpret 0.05 as the probability that our effect is some sort of random noise. The thing is that 0.05 is still a high probability for things that are consequential, a risk we don’t want to take. As an example, you can think that the probability of a fatality given Covid infection has been put at around 0.005 (0.5%). That is 10 times lower than our beloved 0.05 (5%), but for us, since the cost is too high (loss of life), 0.5% is not a low risk and so we will not take that chance by avoiding infection as much as we can.

So, yes, 0.05 is too high for consequential decisions and we will want more evidence which many times will come from external sources. This external evidence can be anything, ultimately, particularly for those into Bayesian Statistics, but I would say that the best external evidence is usually replication with independent teams and experiments. However, I can also see situations where the researcher is very confident about the treatment working, based on experience, related literature, animal experiments, theory and anecdotes. Sometimes you see a significant result and you say “well, no big deal, not surprising at all that this has a positive effect…”. You say that because you have some sort of external evidence, and sometimes that is very important. Other times it will be clear that the significant result is weird, and you will be skeptical. That is lack of external evidence for you. Obviously, you can have some distorted, biased view of things, which has to be factored in too.

So, again, maybe you should take a look at the paper and the examples, but the take home message in my view is that you must always consider a lot more than whether something is significant or not, and making decisions solely based on the significance is going to be associated with poor science practice and poor understanding or statistics


055 (05/May/2020) – Models are part art, part science       to Top ↑

Last week we pointed you to some sources that looked at models for forecasting the spread of Covid, and then it occurred to me that we have an interesting example here of the fact that the modeling endeavour is not really objective. Otherwise, why would we have so many models? So, I just wanted to explore a little bit this point today as I think it is relevant for how we build models.

The first point is that there is no ready recipe out there for how to model your data. Although we will find the more traditional models to be frequently used in a sort of step-by-step way of model building, you not only don’t need to cling to any specific model as if it is the right model for your data, but also it may be counterproductive to do so in the sense that it impairs your ability to do good, creative modeling.
In a more statistical-centric view we may say that there are bad models, and some models are clearly bad, but it is usually unclear what the best model is. And what happens is that experience and even creativity will play some role in the model selection and in our ability to find a good model.

As I said, there are not recipes as far as I know, and that is why I think we still don’t have algorithms doing a good job in predicting the spread of Covid, or algorithms selecting models in general, for that matter. It is not that simple. In my view you will need a substantial amount of experience and knowledge of the different alternatives and of the subject matter in order to be able to build good models. The experience relates to the ability of visualizing the important aspects of the data, model and research question, all at once, and putting that together in a reasonable model.

The subjectivity in model building is also in the assumptions that we have to make, as is also well exemplified in the many Covid related models out there. For example, if a given model demands a specific epidemic spreading parameter, different researchers may reach different values for the parameters depending on their personal view of which data is best used for that end. Some folks may think that the available data is low quality to the extent that their personal guess may be more accurate, taking things to the extreme of subjectivity.

Another major source of assumptions and subjectivity in these models comes up when we must estimate or build models for the effect of physical distancing. Any mathematical definition of physical distancing will end up simplifying things to the extent that one must make decisions as to which simplifications are less harmful to the precision of the model.

Two weeks ago we talked about ARIMA models here in this section. These can be very objective models, and they are the kind of model that computers don’t have much trouble fitting to the data themselves, without human help. However, I have seen zero ARIMA model for the Covid trend! It is important to say that all models we have seen are actually modeling a series of cases, so we could use ARIMA, why not?

Well, you probably could try to use ARIMA models to fit the seasonal flu series, because it is long enough by now, and more importantly, stable. But in the case of Covid, the lack of information is so much that objective and simple models like ARIMA would have no chance to do a good job. Experienced researchers don’t even try it. There is also the fact that the ARIMA models are not traditional part of the epidemiological toolbox, so epidemiologists may not think of it because of that, but there are a lot more folks out there trying models than epidemiologists.

Those of us who are deep into math may not be very happy when we talk about adding subjectivities to our models. There are still folks against Bayesian statistics because it includes the subjectivity of the priors. But I guess we must recognize that the world out there is messy and our minds are able to add information to models that are not in the data. Information that comes in the choice of model and assumptions and our brains. And so the computer depend on us. For the time being.


054 (28/April/2020) – Centering predictors in Regression Models        to Top ↑

This week we got a couple of questions about centering variables in regression models. This is a subject that is always around so we will comment on it briefly.

First, lets define centering. It is usually defined as the transformation that implies subtracting the mean from the original variable. Let’s say that you have the variable Age. You first calculate the average age of your sample, say it is 35 years old. Then for each subject, you subtract 35 from their age. So, a subject who is 25 years old will become -10 and a subject who is 50 will become 15.

What happens is that now the average of the transformed Age will be 0. The Standard Deviation of the transformed Age will not change.

This should not change the model inferential results, in general, in that you will get the same p-values as compared with the non-centering version of the variables. However, if you have continuous variables involved in interactions, they may become easier to interpret, and in particular the main effect of variables involved in interactions will be interpretable when variables are centered.

Imagine that you have Age and BMI, and you centered age at its mean of 35 and BMI at its mean of 28. And you have a linear regression model with only these two variables predicting depression score (DS).

If the model does not involve interaction, then the model intercept is interpretable as the average DS for subjects with Age and BMI at 0 (that is, average Age of 35 and BMI of 28). If you had not centered Age and BMI, the intercept would not be meaningful because there is no subject with Age = 0 and BMI = 0 in the non-centered version of the variable.

If the model has an Age * BMI interaction, then the coefficient of the interaction is how much the Age effect changes when the BMI value changes from 0 (that is, 28) to 1 (that is, 29). The way I think of it (this may be helpful to some) is that if you make BMI = 0, then the Age main effect coefficient is just the Age coefficient because the Age * BMI interaction become Age * 0 and vanishes. If you make BMI = 1, then now you have Age main effect coefficient (you also have the BMI main effect coefficient, of course), and the Age * BMI is Age * 1, that is, the interaction coefficient became another Age coefficient. So, for BMI = 1 the Age coefficient is the sum of the Age main effect coefficient and the Age * BMI coefficient. You can then see that the Age * BMI coefficient is just how much the Age slope increases when you move from BMI = 0 to BMI = 1.

The above paragraph plays out basically the same way if we focus on BMI effect instead of Age, that is, the Age * BMI coefficient is how much the BMI slope changes when we move from Age = 0 (35) to Age = 1 (36). If Age and BMI were not centered, this interpretation would not be possible.

So, in short, if both Age and BMI are centered, the main effect of Age is the slope of Age for subjects at centered BMI = 0 (that is, its average BMI of 28) and the Age * BMI coefficient is how much that slope changes when you go to centered BMI = 1 (that is, BMI = 29). You can easily get a sense of how much the slope of Age (the effect of Age) changes as we change BMI (how BMI moderates the effect of Age), and what is the effect of Age for an average BMI individual (coefficient of main effect Age).

An important point to make is that centering does not have to be at the average, it can be at any value that is relevant. For example, I could center Age at 50 instead of at its average 35, if I am more interested on effects of variables at 50 years old rather than the average Age.

Most of the time the models I see at CAMH have no interaction and so there is not much point to centering predictors. That said, centering is hardly harmful, I mean, it may take you some extra time to do, plus you have to remember that you are working with centered BMI (so that BMI = 25 will probably not exist in the new scale), but those don’t tend to be consequential things.

Finally, I want to discuss one thing which has been contentious, which is that centering helps with multicollinearity. That is the case only if you have interactions in the model, but you are still interested in the main effects. The coefficients of the interaction, and p-values, will not change whether you center the variables involved or not, but the coefficients of the main effects will. You will have in general more precise estimates of main effects in models with interactions, if you center your variables. However, most of the time you are really interested in the interaction, not main effects, and so, centering will not make any difference. Again I want to mention that if you center your variables, that will not harm you. A paper I like that talks about centering and multicollinearity is this one. As evidence of the contentiousness of the subject, someone criticized the paper and the authors published this short follow-up. These papers are nice if you want to get a little deeper into the subject.

We intend to soon follow up with a complement to this text where we want to talk a little bit about changing the scale of the variables (like when you divide by the standard deviation and standardize, rather than center, the variable).


053 (21/April/2020) – Time Series Analysis        to Top ↑

Today we will have a brief introduction to Time Series Analysis, which is not very common among the things we do at CAMH, and that being so we find that many folks are unaware of this statistical methodology, and sometimes mistake it for other techniques related to more usual types of longitudinal data analysis. Here we will explain a little bit of it, and in the next section we provide a nice resource. A nice R related text on Time Series Analysis in R can be found here.

Longitudinal Analysis is often said of the types of techniques that handle data correlated in time, but with few time points. Its goal is pretty much never projection into the future (forecast), as we have little statistical information for that with few data points in time. Instead, the goals of longitudinal analyses are most often to understand change in outcomes between specific time points. This can be done in different ways: by comparing outcome changes or trajectories between groups like intervention and control group; by just estimating changes in time, like from beginning to end of the trial; by gathering evidence for effect of moderators on outcome changes or trajectories; by studying patterns of trajectories; etc.

Although the term “trajectories” is often used in the context of longitudinal data analysis, these are not long sequences of data points, it can be just a two-time points trajectory, and we hardly see more than 4 or 5 time points trajectories. It is also often across many sampling units, which are usually human subjects (that is, you have many trajectories in a single data set, one for each subject).

Sometimes, though, you may face a sequence of data in time that is composed of many time points, say, 30 or more. That is when you think about Time Series Analysis. Because the cost of data collection repeatedly for the same subject used to be expensive, time series data is traditionally not collected at the subject level. For example, you can have daily number of Emergency Department visits as a time series. Or monthly sales of a certain medication. Or weekly number of traffic crashes in Toronto. Or the number of folks entering a subway station at each 10 minutes.

You can see that once you have a series of data points, the type of analysis you can do will need to be very different than the usual regression-based longitudinal data analysis. Even though regression models are so amazing that you can still use them for time-series analysis, specific models have been developed for this kind of data, namely, Time Series Analyses techniques. The usual main goal you have will also be different as it relates to future forecasts, not to specific data points and specific changes, except if you are interested in studying interventions at specific time points and how they affect the series. For example, one could be interested on studying the effect of cannabis legalization on a time series of monthly cannabis consumption.

Probably the most popular time series models are the ARIMA models. They tend to work well in many situations. The idea is to build in an equation a model for the current time t data point (dependent variable) that uses the previous time t-1, t-2, t-3… data points as predictors, as well as previous error terms e(t-1), e(t-2), e(t-3), … . Since data points that are lined up in a sequence tend to be highly correlated, so will be the error terms, and this correlation means that the error (residuals) may carry useful information about the next data point. You can see that in such models a point can be seen as being used twice, as the outcome and then as a predictor for the data points in the future.

ARIMA models can be extended to include seasonal effects as well as intervention effect. Imagine that you have daily visits to Emergence Department at CAMH as your time series data. A Time Series model will show you that the number of visits tomorrow is correlated to the number of visits today, possibly yesterday, possibly 7 days earlier (a weekly seasonal effect), and even maybe 365 days earlier (a yearly seasonal effect – this will be the case if you notice seasonal patterns in the data, like more visits in the winter or summer). In a given day you may add a reception desk to help visitors, and once you have a reasonable amount of daily data after such intervention you can test it, that is, see if adding the desk changed the series in any way. Besides the series of daily visit, maybe you also have daily average temperatures or rainfall, and so there are models that can be used to test those effects, including with lagged effect (for example, the temperature today may have an effect on the number of visits tomorrow).

Even though there are models that deals with many time series simultaneously, that has not been the usual case. However, with data collection becoming easier and more affordable, such situation is becoming more mainstream. As an example, you can think of wearables devices, and you getting the data from those devices at every minute, for 200 subjects, over the period of 10 days. Data like this can be explored in different ways, but we will have to go outside of the traditional ARIMA model. One possibility is the so called Dynamic Structural Equation Models. But we may also have very insightful results with simple descriptive analyses of trajectories, like by studying the changes, slopes, peaks, visualizing, etc. When you have many subjects, the goal can be extended from forecasting to comparing groups of subjects, and to some sort of personalized medicine.

Okay, we did not intend to get too deep into this, but just give you a sense of what kind of data can be handled with these models, as well as types of research goals. So, don’t hesitate reaching out if you think you can use time series but are not sure how.


052 (14/April/2020) – Measurement Error        to Top ↑

The journal Statistics in Medicine has published two papers in its Tutorial in Biostatistics section that brings a comprehensive treatment of measurement error. The first paper introduces different types of measurement error and the second paper looks at advanced methods to deal with it. We will talk a little bit about it, but by no means we would be able to cover the full content of the papers in any way that could be called reasonable because it is advanced and long. Instead we will just get you aware of some points we think are more important for our research.

Measurement error is what the name implies: when you collect your data, you are unable to measure things precisely for different reasons. The extent to which your measure is not precise will affect the statistical analysis that you will do, effect that will always be negative since measurement error is a source of uncertainty: the lower the better.

Maybe the most intuitive thing that happens is that you lose power. Let’s say that you want to measure subject’s age, but for some reason like the subjects not being comfortable telling you their age, you don’t get the age precisely correct. This will make it harder for you to detect effects related to age, that is, it will decrease power.

However, I would say that losing only power is the best-case scenario. If subjects tend to understate their age, for example, meaning that what you get is not only an age variable that is not precise in random ways, but an age variable that is biased downwards, then the consequences can go beyond losing power to straight biased estimates and conclusions from statistical analysis. That can be the case even if your measurement error in age is just random (some subjects understate and other overstate their age).

The classical example of a problem caused by measurement error is when you collect age variable that is not precise and use it as a regression predictor. One of the little mentioned regression assumptions is the assumption that the predictors are measured without error, and you only have error in the outcome. The measurement error present in age will cause the coefficients of age in a linear model to be biased downwards, and to the extent that age is associated to other predictors in the model it will also affect the coefficient of the other predictors. This is problematic, even if you can measure your predictor without bias, coefficients of linear models will be biased, and it is hard to say what happens if the model is not linear.

If you want to adjust for the measurement error in age (or in a regression predictor in general), you will need to know something about that measurement error. When measurement error does not exist the reliability is 1, that is, your measure has perfect reliability. So, knowing the reliability of a predictor is one way to know something about the measurement error. If you have such reliability, one way to account for it in regression models is to fit it in the context of structural equation models, where you create a latent variable age which ahs observed age as its only indicator and which will have its variance adjusted by the known reliability of age.

Usually there is little we can do about measurement error and we end up assuming it does not exist, even if sometimes we don’t realize that we are making this assumption. But it is an important one, particularly revealing in cases where your measure has poor reliability. It is not uncommon for us to work with data that we know is not reliable.

In the two papers, they go deeper in the theory of measurement error, including by defining different types of errors, different methods of adjusting for it and software. And just to mention here, when you measure age in years instead of measuring the exact age, you get a different type of measurement error called Berkson’s error, and in this case linear model coefficients are not biased.





  • No labels