Two-stage least squares - Biased and Inefficient

Attention Conservation Notice: someone asked about this on social media, and I probably should have it written down for the advanced regression course, but you probably don’t need to know it. Also, if you’re an econometrician then I hate your notation and so you will probably hate mine

In two-stage least squares you have an outcome and want to fit a causal linear regression , but have problems with, say, confounding. If you just did the fit naively you would get an estimate that is biased for the causal and instead estimates the conditional expectation in the current population.

You have some instrumental variables that have all the right properties to be instrumental variables, in particular that the effect of on goes entirely through .

Let be the coefficients for regressing on and be the coefficients for regressing on . The effect of on is , and since this is all mediated by the effect of on is , which can be estimated by . Alternatively, we can take the fitted values and regress on to give directly. These give the same coefficients for

The ratio: where the hats are the usual empirical estimates, which simplifies to
The two-stage: Noting that , where is the projection on to the space spanned by , this is The denominator is and the numerator is , and again everything except the last bit cancels.

More generally, you can have multiple s and at least that many s and do the two-stage thing. Again , and the estimator is

Now for the key question. What is the (asymptotic?) variance of ? Is it just what you’d get from the second-stage model? No, but nearly.

We could write in terms of the inputs and then just use bilinearity of variances:

This looks very much like it will simplify to just the ordinary regression variance! There’s one trap: . Our model says , for (say) iid Normal . It doesn’t say .

If we replace by the estimator , where is the identity, we can’t use but we can use to get a consistent estimator of the variance, for some suitable degree-of-freedom correction. The sandwich variance estimator then simplifies in the usual way as the middle and right-hand chunks cancel each other out to give

Now lets’s do an example from AER::ivreg

suppressMessages(library(AER))
data("CigarettesSW", package = "AER")
CigarettesSW <- transform(CigarettesSW,
  rprice  = price/cpi,
  rincome = income/population/cpi,
  tdiff   = (taxs - tax)/cpi
)
fm2 <- ivreg(log(packs) ~ log(rprice) | tdiff, data = CigarettesSW, subset = year == "1995")

Stage 1

stage1 <-lm(log(rprice)~tdiff, data = CigarettesSW, subset = year == "1995")
hatX <- fitted(stage1)
Y<-with(subset(CigarettesSW, year == "1995"), log(packs))
stage2<-lm(Y~hatX)

The unscaled covariance matrices are the same

summary(stage2)$cov.unscaled

##             (Intercept)       hatX
## (Intercept)    63.26849 -13.227909
## hatX          -13.22791   2.766546

fm2$cov.unscaled

##             (Intercept) log(rprice)
## (Intercept)    63.26849  -13.227909
## log(rprice)   -13.22791    2.766546

The residual variance in stage2 is not correct

sd(resid(stage2))

## [1] 0.2240255

fm2$sigma

## [1] 0.1903539

But we can compute the right one using instead of

df<-stage2$df.residual
X<-with(subset(CigarettesSW, year == "1995"), log(rprice))
sqrt(sum((Y-cbind(1,X)%*%coef(stage2))^2)/df)

## [1] 0.1903539

The other example on the help page has univariate , bivariate , and univariate adjustment variables , which we handle by pasting on to the right of both and :

fm <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + tdiff + I(tax/cpi),
  data = CigarettesSW, subset = year == "1995")

We get

X<- with(subset(CigarettesSW, year == "1995"), log(rprice))
Z<- with(subset(CigarettesSW, year == "1995"), cbind(tdiff,I(tax/cpi)))
A<-with(subset(CigarettesSW, year == "1995"), log(rincome))



stage1<-lm(cbind(X,A)~Z+A)
hatX<-fitted(stage1)
stage2<-lm(Y~hatX+A)

Compare the coefficients and unscaled covariance matrices:

coef(fm)

##  (Intercept)  log(rprice) log(rincome) 
##    9.8949555   -1.2774241    0.2804048

coef(stage2)

## (Intercept)       hatXX       hatXA           A 
##   9.8949555  -1.2774241   0.2804048          NA

fm$cov.unscaled

##              (Intercept) log(rprice) log(rincome)
## (Intercept)   31.7527079  -6.7990694    0.2898522
## log(rprice)   -6.7990694   1.9629850   -0.9648723
## log(rincome)   0.2898522  -0.9648723    1.6127420

summary(stage2)$cov.unscaled

##             (Intercept)      hatXX      hatXA
## (Intercept)  31.7527079 -6.7990694  0.2898522
## hatXX        -6.7990694  1.9629850 -0.9648723
## hatXA         0.2898522 -0.9648723  1.6127420

and again we can’t use the residual variance from stage2 but we can fix it

fm$sigma

## [1] 0.187856

summary(stage2)$sigma

## [1] 0.2025322

df<-stage2$df.residual
theta<-na.omit(coef(stage2))
sqrt(sum( (Y-cbind(1,X,A)%*%theta)^2)/df)

## [1] 0.187856