7 min read

Collinearity four more times

Quartets of datasets are trendy now, so here’s one for collinearity diagnostics. These four data sets have two candidate predictors \(x\) and \(z\) and outcome \(y\), and have very strong correlation (over 0.95) between \(x\) and \(z\). I will use the car package for variance inflation factors

set.seed(2024-11-21)
library(car)
## Loading required package: carData

Redundancy

In the first example, \(Y\perp X\mid Z\), so for prediction it is ok to drop \(X\), and also the coefficients will estimate the true effects if you do

d1<-data.frame(x=rnorm(1000))
d1$z<-d1$x+rnorm(1000,s=.2)
d1$y<-d1$z+rnorm(1000,s=.7)

m1a<-lm(y~x+z,data=d1)
summary(m1a)
## 
## Call:
## lm(formula = y ~ x + z, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.72380 -0.48821 -0.04224  0.45949  2.36091 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01373    0.02237  -0.614    0.540    
## x            0.12971    0.11555   1.123    0.262    
## z            0.84855    0.11321   7.495 1.46e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7065 on 997 degrees of freedom
## Multiple R-squared:  0.6534,	Adjusted R-squared:  0.6527 
## F-statistic: 939.6 on 2 and 997 DF,  p-value: < 2.2e-16
vif(m1a)
##        x        z 
## 25.41849 25.41849
m1b<-lm(y~z,data=d1)
summary(m1b) 
## 
## Call:
## lm(formula = y ~ z, data = d1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74387 -0.50931 -0.04643  0.44706  2.35572 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01510    0.02234  -0.676    0.499    
## z            0.97312    0.02246  43.330   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7065 on 998 degrees of freedom
## Multiple R-squared:  0.6529,	Adjusted R-squared:  0.6526 
## F-statistic:  1877 on 1 and 998 DF,  p-value: < 2.2e-16
anova(m1b,m1a)
## Analysis of Variance Table
## 
## Model 1: y ~ z
## Model 2: y ~ x + z
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    998 498.21                           
## 2    997 497.58  1   0.62894 1.2602 0.2619

Combination

The second example is the Mr Micawber correlation structure. \(Y\) depends on \(X-Z\) and in the eponymous example \(X\) is expenditure and \(Z\) is income. Each predictor on its own is relatively weakly associated with \(Y\), but the combination is strongly predictive. Dropping either predictor is a bad idea.1

d2<-data.frame(x=rnorm(1000))
d2$z<-d2$x+rnorm(1000,s=.2)
d2$y<-d2$x-d2$z+rnorm(1000,s=1.4)

m2a<-lm(y~x+z,data=d2)
summary(m2a)
## 
## Call:
## lm(formula = y ~ x + z, data = d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0452 -1.0617 -0.0624  0.9366  4.0579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02221    0.04400   0.505    0.614    
## x            0.88973    0.22178   4.012 6.48e-05 ***
## z           -0.99435    0.21848  -4.551 5.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.389 on 997 degrees of freedom
## Multiple R-squared:  0.02586,	Adjusted R-squared:  0.0239 
## F-statistic: 13.23 on 2 and 997 DF,  p-value: 2.13e-06
vif(m2a)
##       x       z 
## 27.6268 27.6268
m2b<-lm(y~z,data=d2)
summary(m2b) 
## 
## Call:
## lm(formula = y ~ z, data = d2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9515 -1.0261 -0.0484  0.9315  4.1074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.01638    0.04430   0.370  0.71166   
## z           -0.13385    0.04188  -3.196  0.00144 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.4 on 998 degrees of freedom
## Multiple R-squared:  0.01013,	Adjusted R-squared:  0.00914 
## F-statistic: 10.21 on 1 and 998 DF,  p-value: 0.001437
anova(m2b,m2a)
## Analysis of Variance Table
## 
## Model 1: y ~ z
## Model 2: y ~ x + z
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    998 1955.4                                  
## 2    997 1924.3  1    31.064 16.094 6.477e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Confounding

The third example is straightforward confounding. \(X\) is a confounder for the effect of \(Z\) on \(Y\). The model with just \(Z\) is extremely misleading.

d3<-data.frame(x=rnorm(100))
d3$z<-d3$x+rnorm(100,s=.2)
d3$y<-d3$x+rnorm(100)


m3a<-lm(y~x+z,data=d3)
summary(m3a)
## 
## Call:
## lm(formula = y ~ x + z, data = d3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.56402 -0.52576  0.01137  0.61253  3.02420 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.01776    0.09518   0.187 0.852371    
## x            1.80065    0.48551   3.709 0.000347 ***
## z           -0.63875    0.47192  -1.354 0.179037    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9252 on 97 degrees of freedom
## Multiple R-squared:  0.6086,	Adjusted R-squared:  0.6005 
## F-statistic: 75.41 on 2 and 97 DF,  p-value: < 2.2e-16
vif(m3a)
##        x        z 
## 26.27358 26.27358
m3b<-lm(y~z,data=d3)
summary(m3b) 
## 
## Call:
## lm(formula = y ~ z, data = d3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7154 -0.6226 -0.0672  0.6294  3.1791 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.06316    0.09850  -0.641    0.523    
## z            1.07784    0.09788  11.012   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9835 on 98 degrees of freedom
## Multiple R-squared:  0.5531,	Adjusted R-squared:  0.5485 
## F-statistic: 121.3 on 1 and 98 DF,  p-value: < 2.2e-16
anova(m3b,m3a)
## Analysis of Variance Table
## 
## Model 1: y ~ z
## Model 2: y ~ x + z
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     98 94.799                                  
## 2     97 83.026  1    11.773 13.755 0.0003471 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Measurement noise

Finally, a setting where a latent variable affects both \(X\) and \(Z\) and also \(Y.\) The best model here would use \((X+Z)/2\) as a single predictor – the two predictors have no additional information in them – but using the two variables individually is better than using just one of them.

latent<-rnorm(100,s=0.7)
noise<-rnorm(100,s=0.7)
d4<-data.frame(x=latent+noise+rnorm(100,s=0.15),
		z=latent+noise+rnorm(100,s=0.15),
		y=latent+rnorm(100,s=1))


m4a<-lm(y~x+z,data=d4)
summary(m4a)
## 
## Call:
## lm(formula = y ~ x + z, data = d4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3877 -0.6169  0.0419  0.6557  2.2683 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.07951    0.11253   0.707    0.482
## x           -0.50744    0.60089  -0.844    0.400
## z            1.00203    0.62074   1.614    0.110
## 
## Residual standard error: 1.119 on 97 degrees of freedom
## Multiple R-squared:  0.1452,	Adjusted R-squared:  0.1276 
## F-statistic: 8.237 on 2 and 97 DF,  p-value: 0.0004964
vif(m4a)
##       x       z 
## 25.4797 25.4797
m4b<-lm(y~z,data=d4)
summary(m4b) 
## 
## Call:
## lm(formula = y ~ z, data = d4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3204 -0.6567  0.0560  0.6260  2.3681 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.07293    0.11209   0.651 0.516796    
## z            0.48822    0.12279   3.976 0.000134 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.117 on 98 degrees of freedom
## Multiple R-squared:  0.1389,	Adjusted R-squared:  0.1301 
## F-statistic: 15.81 on 1 and 98 DF,  p-value: 0.0001342
anova(m4b,m4a)
## Analysis of Variance Table
## 
## Model 1: y ~ z
## Model 2: y ~ x + z
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     98 122.28                           
## 2     97 121.39  1   0.89246 0.7131 0.4005

Summary

If you want a predictive model there’s no need to drop correlated variables because you will already be evaluating models based on predictive accuracy. You do want to drop variables with redundant information, if you can, but that isn’t the same thing as correlation (as example #2 shows). On the other hand, if you want a causal model, you can’t rely on correlations to choose it.

A high VIF tells you that there’s less information about the conditional coefficients \(\beta_{Z|X}\) and \(\beta_{X|Z}\) than about the single-variable coefficients \(\beta_X\) and \(\beta_Z\). It doesn’t tell you which set of coefficients you are interested in.


  1. This is also the classic scenario where forward selection for predictive models does poorly.↩︎