3 min read

Co-linearity

Co-linearity of regression predictors (often ‘multicollinearity’) is one of those topics where a lot of regression textbooks are Unhelpful, in part because they don’t think about the reasons for fitting a model. Their idea is, often, that you should examine diagnostics that tell you whether there is multicollinearity, and use these diagnostics to remove variables from your model.

Let us suppose we have two predictor variables X and Z and an outcome Y, and that the models E[Y|X=x]=β0+βXx and E[Y|X=x,Z=z]=γ0+γXx+γZz are (approximately) correctly specified. We can write down relationships between the βs and γs. For example, if δZX is the regression coefficient of Z on X, then βX=γX+γZδZX. Since we will typically not have βX=γX, if we’re interested in the coefficient of X we’d better hope that domain knowledge tells us whether we’re interested in γX or βX. Let us assume it does.

The Fisher information about βX is the variance of X divided by the residual variance of Y; you learn more about βX when X varies a lot and when there isn’t much residual noise. The Fisher information about γX, however, is the conditional variance of X given Z, divided by the residual variance of Y: in order to learn about γX you need X to vary separately from Z.

If the correlation between X and Z is approximately zero, then βXγX and the information about γX is about the same as, or greater than, the information about βX. That’s the designed-experiment setting.

If the correlation is high, the information about γX may be much smaller than about βX, and the two coefficients may be quite different. In that setting, it matters a lot which model you decide to use. Collinearity diagnostics aren’t much help, because they don’t tell you whether you’re interested in βX or γX. What they tell you is that if you’re interested in γX, it sucks to be you, because the data don’t provide much information about γX. You don’t need colinearity diagnostics to tell you that; you can just look at an uncertainty interval. However tempting it might be, you can’t just react to a poor estimate of γX by pretending you always wanted to estimate βX instead.

Prediction

The other possibility is that you’re not interested in the coefficients, just in prediction. When you want to do prediction, the strategy is to fit a whole bunch of models, probably with some sort of complexity penalty, and estimate the prediction error of each one. If the same correlation structure will be present when the models are used, you don’t have a problem. You might be tempted to use the correlation to reduce the set of predictors, but as Charles Dickens shows us, this requires domain knowledge

Annual income twenty pounds, annual expenditure nineteen nineteen and six, result happiness. Annual income twenty pounds, annual expenditure twenty pounds ought and six, result misery. – Mr Micawber, in David Copperfield

Even though income and expenditure are highly correlated, neither one separately will do anywhere near as well as the pair in predicting who ends up happy and who ends up in debtor’s prison.

If the same correlation structure will not be present when the model is used, we’re back in the causal inference setting: it sucks to be you. It will likely matter whether you use just X or just Z or both for prediction, and the data can’t be relied on to advise you.