Two-stage least squares - Biased and Inefficient

Attention Conservation Notice: someone asked about this on social media, and I probably should have it written down for the advanced regression course, but you probably don’t need to know it. Also, if you’re an econometrician then I hate your notation and so you will probably hate mine

In two-stage least squares you have an outcome $Y$ and want to fit a causal linear regression $E [Y | d o (X = x)] = θ x$ , but have problems with, say, confounding. If you just did the fit naively you would get an estimate $\hat{β}$ that is biased for the causal $θ$ and instead estimates the conditional expectation $E [Y | X = x] = β X$ in the current population.

You have some instrumental variables $Z$ that have all the right properties to be instrumental variables, in particular that the effect of $Z$ on $Y$ goes entirely through $X$ .

Let $γ$ be the coefficients for regressing $Y$ on $Z$ and $δ$ be the coefficients for regressing $X$ on $Z$ . The effect of $Z$ on $Y$ is $γ$ , and since this is all mediated by $X$ the effect of $X$ on $Y$ is $γ / δ$ , which can be estimated by $\hat{θ} = \hat{γ} / \hat{δ}$ . Alternatively, we can take the fitted values $\hat{X} = \hat{δ} Z$ and regress $Y$ on $\hat{X}$ to give $\hat{θ}$ directly. These give the same coefficients for $X$

The ratio: $\hat{θ} = \frac{\hat{γ}}{\hat{δ}} = \frac{\hat{v a r} [Z] \hat{c o v} [Z, Y]}{\hat{v a r} [Z] \hat{c o v} [Z, X]},$ where the hats are the usual empirical estimates, which simplifies to $\hat{c o v} [Z, Y] / \hat{c o v} [Z, X]$
The two-stage: $\hat{θ} = ({\hat{X}}^{T} \hat{X})^{- 1} {\hat{X}}^{T} Y .$ Noting that $\hat{X} = P_{Z} X$ , where $P_{Z}$ is the projection on to the space spanned by $Z$ , this is $\hat{θ} = (X^{T} P_{Z}^{T} P_{Z} X)^{- 1} X^{T} P_{Z}^{T} Y = (X^{T} P_{Z} X)^{- 1} X^{T} P_{Z}^{T} Y .$ The denominator is $X^{T} Z (Z^{T} Z)^{- 1} Z^{T} X$ and the numerator is $X^{T} Z (Z^{T} Z)^{- 1} Z^{T} Y$ , and again everything except the last bit cancels.

More generally, you can have multiple $X$ s and at least that many $Z$ s and do the two-stage thing. Again $\hat{X} = \hat{δ} X = P_{Z} X$ , and the estimator is $\hat{θ} = ({\hat{X}}^{T} \hat{X})^{- 1} {\hat{X}}^{T} Y .$

Now for the key question. What is the (asymptotic?) variance of $\hat{θ}$ ? Is it just what you’d get from the second-stage model? No, but nearly.

We could write $\hat{θ}$ in terms of the inputs $\hat{θ} = (X^{T} P_{Z}^{T} P_{Z} X)^{- 1} X^{T} P_{Z}^{T} Y,$ and then just use bilinearity of variances: $v a r [\hat{θ}] = (X^{T} P_{Z}^{T} P_{Z} X)^{- 1} X^{T} P_{Z}^{T} v a r [Y] P_{Z} X (X^{T} P_{Z}^{T} P_{Z} X)^{- 1} .$

This looks very much like it will simplify to just the ordinary regression variance! There’s one trap: $v a r [Y]$ . Our model says $Y = X θ + ϵ$ , for (say) iid Normal $ϵ$ . It doesn’t say $Y = \hat{X} θ + ϵ$ .

If we replace $v a r [Y]$ by the estimator ${\hat{σ}}^{2} I$ , where $I$ is the identity, we can’t use ${\hat{σ}}_{wrong}^{2} = \frac{1}{n - df} \sum_{i} (Y_{i} - {\hat{X}}_{i} \hat{θ})^{2}$ but we can use ${\hat{σ}}^{2} = \frac{1}{n - df} \sum_{i} (Y_{i} - X_{i} \hat{θ})^{2}$ to get a consistent estimator of the variance, for some suitable degree-of-freedom correction. The sandwich variance estimator then simplifies in the usual way as the middle and right-hand chunks cancel each other out to give $v a r [\hat{θ}] = {\hat{σ}}^{2} (X^{T} P_{Z}^{T} X)^{- 1} = {\hat{σ}}^{2} ({\hat{X}}^{T} \hat{X})^{- 1} .$

Now lets’s do an example from AER::ivreg

suppressMessages(library(AER))
data("CigarettesSW", package = "AER")
CigarettesSW <- transform(CigarettesSW,
  rprice  = price/cpi,
  rincome = income/population/cpi,
  tdiff   = (taxs - tax)/cpi
)
fm2 <- ivreg(log(packs) ~ log(rprice) | tdiff, data = CigarettesSW, subset = year == "1995")

Stage 1

stage1 <-lm(log(rprice)~tdiff, data = CigarettesSW, subset = year == "1995")
hatX <- fitted(stage1)
Y<-with(subset(CigarettesSW, year == "1995"), log(packs))
stage2<-lm(Y~hatX)

The unscaled covariance matrices are the same

summary(stage2)$cov.unscaled

##             (Intercept)       hatX
## (Intercept)    63.26849 -13.227909
## hatX          -13.22791   2.766546

fm2$cov.unscaled

##             (Intercept) log(rprice)
## (Intercept)    63.26849  -13.227909
## log(rprice)   -13.22791    2.766546

The residual variance in stage2 is not correct

sd(resid(stage2))

## [1] 0.2240255

fm2$sigma

## [1] 0.1903539

But we can compute the right one using $Y - \hat{θ} X$ instead of $Y - \hat{θ} \hat{X}$

df<-stage2$df.residual
X<-with(subset(CigarettesSW, year == "1995"), log(rprice))
sqrt(sum((Y-cbind(1,X)%*%coef(stage2))^2)/df)

## [1] 0.1903539

The other example on the help page has univariate $X$ , bivariate $Z$ , and univariate adjustment variables $A$ , which we handle by pasting $A$ on to the right of both $Z$ and $X$ :

fm <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + tdiff + I(tax/cpi),
  data = CigarettesSW, subset = year == "1995")

We get

X<- with(subset(CigarettesSW, year == "1995"), log(rprice))
Z<- with(subset(CigarettesSW, year == "1995"), cbind(tdiff,I(tax/cpi)))
A<-with(subset(CigarettesSW, year == "1995"), log(rincome))



stage1<-lm(cbind(X,A)~Z+A)
hatX<-fitted(stage1)
stage2<-lm(Y~hatX+A)

Compare the coefficients and unscaled covariance matrices:

coef(fm)

##  (Intercept)  log(rprice) log(rincome) 
##    9.8949555   -1.2774241    0.2804048

coef(stage2)

## (Intercept)       hatXX       hatXA           A 
##   9.8949555  -1.2774241   0.2804048          NA

fm$cov.unscaled

##              (Intercept) log(rprice) log(rincome)
## (Intercept)   31.7527079  -6.7990694    0.2898522
## log(rprice)   -6.7990694   1.9629850   -0.9648723
## log(rincome)   0.2898522  -0.9648723    1.6127420

summary(stage2)$cov.unscaled

##             (Intercept)      hatXX      hatXA
## (Intercept)  31.7527079 -6.7990694  0.2898522
## hatXX        -6.7990694  1.9629850 -0.9648723
## hatXA         0.2898522 -0.9648723  1.6127420

and again we can’t use the residual variance from stage2 but we can fix it

fm$sigma

## [1] 0.187856

summary(stage2)$sigma

## [1] 0.2025322

df<-stage2$df.residual
theta<-na.omit(coef(stage2))
sqrt(sum( (Y-cbind(1,X,A)%*%theta)^2)/df)

## [1] 0.187856