4 min read

Two-stage least squares

Attention Conservation Notice: someone asked about this on social media, and I probably should have it written down for the advanced regression course, but you probably don’t need to know it. Also, if you’re an econometrician then I hate your notation and so you will probably hate mine

In two-stage least squares you have an outcome Y and want to fit a causal linear regression E[Y|do(X=x)]=θx, but have problems with, say, confounding. If you just did the fit naively you would get an estimate ˆβ that is biased for the causal θ and instead estimates the conditional expectation E[Y|X=x]=βX in the current population.

You have some instrumental variables Z that have all the right properties to be instrumental variables, in particular that the effect of Z on Y goes entirely through X.

Let γ be the coefficients for regressing Y on Z and δ be the coefficients for regressing X on Z. The effect of Z on Y is γ, and since this is all mediated by X the effect of X on Y is γ/δ, which can be estimated by ˆθ=ˆγ/ˆδ. Alternatively, we can take the fitted values ˆX=ˆδZ and regress Y on ˆX to give ˆθ directly. These give the same coefficients for X

  • The ratio: ˆθ=ˆγˆδ=^var[Z]^cov[Z,Y]^var[Z]^cov[Z,X], where the hats are the usual empirical estimates, which simplifies to ^cov[Z,Y]/^cov[Z,X]
  • The two-stage: ˆθ=(ˆXTˆX)1ˆXTY. Noting that ˆX=PZX, where PZ is the projection on to the space spanned by Z, this is ˆθ=(XTPTZPZX)1XTPTZY=(XTPZX)1XTPTZY. The denominator is XTZ(ZTZ)1ZTX and the numerator is XTZ(ZTZ)1ZTY, and again everything except the last bit cancels.

More generally, you can have multiple Xs and at least that many Zs and do the two-stage thing. Again ˆX=ˆδX=PZX, and the estimator is ˆθ=(ˆXTˆX)1ˆXTY.

Now for the key question. What is the (asymptotic?) variance of ˆθ? Is it just what you’d get from the second-stage model? No, but nearly.

We could write ˆθ in terms of the inputs ˆθ=(XTPTZPZX)1XTPTZY, and then just use bilinearity of variances: var[ˆθ]=(XTPTZPZX)1XTPTZvar[Y]PZX(XTPTZPZX)1.

This looks very much like it will simplify to just the ordinary regression variance! There’s one trap: var[Y]. Our model says Y=Xθ+ϵ, for (say) iid Normal ϵ. It doesn’t say Y=ˆXθ+ϵ.

If we replace var[Y] by the estimator ˆσ2I, where I is the identity, we can’t use ˆσ2wrong=1ndfi(YiˆXiˆθ)2 but we can use ˆσ2=1ndfi(YiXiˆθ)2 to get a consistent estimator of the variance, for some suitable degree-of-freedom correction. The sandwich variance estimator then simplifies in the usual way as the middle and right-hand chunks cancel each other out to give var[ˆθ]=ˆσ2(XTPTZX)1=ˆσ2(ˆXTˆX)1.

Now lets’s do an example from AER::ivreg

suppressMessages(library(AER))
data("CigarettesSW", package = "AER")
CigarettesSW <- transform(CigarettesSW,
  rprice  = price/cpi,
  rincome = income/population/cpi,
  tdiff   = (taxs - tax)/cpi
)
fm2 <- ivreg(log(packs) ~ log(rprice) | tdiff, data = CigarettesSW, subset = year == "1995")

Stage 1

stage1 <-lm(log(rprice)~tdiff, data = CigarettesSW, subset = year == "1995")
hatX <- fitted(stage1)
Y<-with(subset(CigarettesSW, year == "1995"), log(packs))
stage2<-lm(Y~hatX)

The unscaled covariance matrices are the same

summary(stage2)$cov.unscaled
##             (Intercept)       hatX
## (Intercept)    63.26849 -13.227909
## hatX          -13.22791   2.766546
fm2$cov.unscaled
##             (Intercept) log(rprice)
## (Intercept)    63.26849  -13.227909
## log(rprice)   -13.22791    2.766546

The residual variance in stage2 is not correct

sd(resid(stage2))
## [1] 0.2240255
fm2$sigma
## [1] 0.1903539

But we can compute the right one using YˆθX instead of YˆθˆX

df<-stage2$df.residual
X<-with(subset(CigarettesSW, year == "1995"), log(rprice))
sqrt(sum((Y-cbind(1,X)%*%coef(stage2))^2)/df)
## [1] 0.1903539

The other example on the help page has univariate X, bivariate Z, and univariate adjustment variables A, which we handle by pasting A on to the right of both Z and X:

fm <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + tdiff + I(tax/cpi),
  data = CigarettesSW, subset = year == "1995")

We get

X<- with(subset(CigarettesSW, year == "1995"), log(rprice))
Z<- with(subset(CigarettesSW, year == "1995"), cbind(tdiff,I(tax/cpi)))
A<-with(subset(CigarettesSW, year == "1995"), log(rincome))



stage1<-lm(cbind(X,A)~Z+A)
hatX<-fitted(stage1)
stage2<-lm(Y~hatX+A)

Compare the coefficients and unscaled covariance matrices:

coef(fm)
##  (Intercept)  log(rprice) log(rincome) 
##    9.8949555   -1.2774241    0.2804048
coef(stage2)
## (Intercept)       hatXX       hatXA           A 
##   9.8949555  -1.2774241   0.2804048          NA
fm$cov.unscaled
##              (Intercept) log(rprice) log(rincome)
## (Intercept)   31.7527079  -6.7990694    0.2898522
## log(rprice)   -6.7990694   1.9629850   -0.9648723
## log(rincome)   0.2898522  -0.9648723    1.6127420
summary(stage2)$cov.unscaled
##             (Intercept)      hatXX      hatXA
## (Intercept)  31.7527079 -6.7990694  0.2898522
## hatXX        -6.7990694  1.9629850 -0.9648723
## hatXA         0.2898522 -0.9648723  1.6127420

and again we can’t use the residual variance from stage2 but we can fix it

fm$sigma
## [1] 0.187856
summary(stage2)$sigma
## [1] 0.2025322
df<-stage2$df.residual
theta<-na.omit(coef(stage2))
sqrt(sum( (Y-cbind(1,X,A)%*%theta)^2)/df)
## [1] 0.187856