Attention Conservation Notice: someone asked about this on social media, and I probably should have it written down for the advanced regression course, but you probably don’t need to know it. Also, if you’re an econometrician then I hate your notation and so you will probably hate mine
In two-stage least squares you have an outcome \(Y\) and want to fit a causal linear regression \(E[Y|do(X=x)]=\theta x\), but have problems with, say, confounding. If you just did the fit naively you would get an estimate \(\hat\beta\) that is biased for the causal \(\theta\) and instead estimates the conditional expectation \(E[Y|X=x]=\beta X\) in the current population.
You have some instrumental variables \(Z\) that have all the right properties to be instrumental variables, in particular that the effect of \(Z\) on \(Y\) goes entirely through \(X\).
Let \(\gamma\) be the coefficients for regressing \(Y\) on \(Z\) and \(\delta\) be the coefficients for regressing \(X\) on \(Z\). The effect of \(Z\) on \(Y\) is \(\gamma\), and since this is all mediated by \(X\) the effect of \(X\) on \(Y\) is \(\gamma/\delta\), which can be estimated by \(\hat\theta=\hat\gamma/\hat\delta\). Alternatively, we can take the fitted values \(\hat X=\hat \delta Z\) and regress \(Y\) on \(\hat X\) to give \(\hat\theta\) directly. These give the same coefficients for \(X\)
- The ratio: \[\hat\theta=\frac{\hat\gamma}{\hat\delta}= \frac{\widehat{\mathrm{var}}[Z]\widehat{\mathrm{cov}}[Z,Y]}{\widehat{\mathrm{var}}[Z]\widehat{\mathrm{cov}}[Z,X]},\] where the hats are the usual empirical estimates, which simplifies to \(\widehat{\mathrm{cov}}[Z,Y]/\widehat{\mathrm{cov}}[Z,X]\)
- The two-stage: \[\hat\theta = (\hat X^T\hat X)^{-1}\hat X^TY.\] Noting that \(\hat X=P_ZX\), where \(P_Z\) is the projection on to the space spanned by \(Z\), this is \[\hat\theta=(X^TP_Z^TP_ZX)^{-1}X^TP_Z^TY =(X^TP_ZX)^{-1}X^TP_Z^TY.\] The denominator is \(X^TZ(Z^TZ)^{-1}Z^TX\) and the numerator is \(X^TZ(Z^TZ)^{-1}Z^TY\), and again everything except the last bit cancels.
More generally, you can have multiple \(X\)s and at least that many \(Z\)s and do the two-stage thing. Again \(\hat X=\hat \delta X= P_ZX\), and the estimator is \[\hat\theta = (\hat X^T\hat X)^{-1}\hat X^TY.\]
Now for the key question. What is the (asymptotic?) variance of \(\hat\theta\)? Is it just what you’d get from the second-stage model? No, but nearly.
We could write \(\hat\theta\) in terms of the inputs \[\hat\theta = ( X^TP_Z^TP_Z X)^{-1} X^TP_Z^TY,\] and then just use bilinearity of variances: \[\mathrm{var}[\hat\theta]= ( X^TP_Z^TP_Z X)^{-1} X^TP_Z^T\mathrm{var}[Y] P_ZX( X^TP_Z^TP_Z X)^{-1}.\]
This looks very much like it will simplify to just the ordinary regression variance! There’s one trap: \(\mathrm{var}[Y]\). Our model says \(Y=X\theta+\epsilon\), for (say) iid Normal \(\epsilon\). It doesn’t say \(Y=\hat X\theta+\epsilon\).
If we replace \(\mathrm{var}[Y]\) by the estimator \(\hat\sigma^2 I\), where \(I\) is the identity, we can’t use \[\hat\sigma^2_{\text{wrong}}=\frac{1}{n-\text{df}} \sum_i (Y_i-\hat X_i\hat\theta)^2\] but we can use \[\hat\sigma^2=\frac{1}{n-\text{df}} \sum_i (Y_i-X_i\hat\theta)^2\] to get a consistent estimator of the variance, for some suitable degree-of-freedom correction. The sandwich variance estimator then simplifies in the usual way as the middle and right-hand chunks cancel each other out to give \[\mathrm{var}[\hat\theta]= \hat\sigma^2 ( X^TP_Z^T X)^{-1} =\hat\sigma^2 ( \hat X^T\hat X)^{-1}.\]
Now lets’s do an example from AER::ivreg
suppressMessages(library(AER))
data("CigarettesSW", package = "AER")
CigarettesSW <- transform(CigarettesSW,
rprice = price/cpi,
rincome = income/population/cpi,
tdiff = (taxs - tax)/cpi
)
fm2 <- ivreg(log(packs) ~ log(rprice) | tdiff, data = CigarettesSW, subset = year == "1995")
Stage 1
stage1 <-lm(log(rprice)~tdiff, data = CigarettesSW, subset = year == "1995")
hatX <- fitted(stage1)
Y<-with(subset(CigarettesSW, year == "1995"), log(packs))
stage2<-lm(Y~hatX)
The unscaled covariance matrices are the same
summary(stage2)$cov.unscaled
## (Intercept) hatX
## (Intercept) 63.26849 -13.227909
## hatX -13.22791 2.766546
fm2$cov.unscaled
## (Intercept) log(rprice)
## (Intercept) 63.26849 -13.227909
## log(rprice) -13.22791 2.766546
The residual variance in stage2
is not correct
sd(resid(stage2))
## [1] 0.2240255
fm2$sigma
## [1] 0.1903539
But we can compute the right one using \(Y-\hat\theta X\) instead of \(Y-\hat\theta \hat X\)
df<-stage2$df.residual
X<-with(subset(CigarettesSW, year == "1995"), log(rprice))
sqrt(sum((Y-cbind(1,X)%*%coef(stage2))^2)/df)
## [1] 0.1903539
The other example on the help page has univariate \(X\), bivariate \(Z\), and univariate adjustment variables \(A\), which we handle by pasting \(A\) on to the right of both \(Z\) and \(X\):
fm <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + tdiff + I(tax/cpi),
data = CigarettesSW, subset = year == "1995")
We get
X<- with(subset(CigarettesSW, year == "1995"), log(rprice))
Z<- with(subset(CigarettesSW, year == "1995"), cbind(tdiff,I(tax/cpi)))
A<-with(subset(CigarettesSW, year == "1995"), log(rincome))
stage1<-lm(cbind(X,A)~Z+A)
hatX<-fitted(stage1)
stage2<-lm(Y~hatX+A)
Compare the coefficients and unscaled covariance matrices:
coef(fm)
## (Intercept) log(rprice) log(rincome)
## 9.8949555 -1.2774241 0.2804048
coef(stage2)
## (Intercept) hatXX hatXA A
## 9.8949555 -1.2774241 0.2804048 NA
fm$cov.unscaled
## (Intercept) log(rprice) log(rincome)
## (Intercept) 31.7527079 -6.7990694 0.2898522
## log(rprice) -6.7990694 1.9629850 -0.9648723
## log(rincome) 0.2898522 -0.9648723 1.6127420
summary(stage2)$cov.unscaled
## (Intercept) hatXX hatXA
## (Intercept) 31.7527079 -6.7990694 0.2898522
## hatXX -6.7990694 1.9629850 -0.9648723
## hatXA 0.2898522 -0.9648723 1.6127420
and again we can’t use the residual variance from stage2
but we can fix it
fm$sigma
## [1] 0.187856
summary(stage2)$sigma
## [1] 0.2025322
df<-stage2$df.residual
theta<-na.omit(coef(stage2))
sqrt(sum( (Y-cbind(1,X,A)%*%theta)^2)/df)
## [1] 0.187856