Attention Conservation Notice: someone asked about this on social media, and I probably should have it written down for the advanced regression course, but you probably don’t need to know it. Also, if you’re an econometrician then I hate your notation and so you will probably hate mine
In two-stage least squares you have an outcome Y and want to fit a causal linear regression E[Y|do(X=x)]=θx, but have problems with, say, confounding. If you just did the fit naively you would get an estimate ˆβ that is biased for the causal θ and instead estimates the conditional expectation E[Y|X=x]=βX in the current population.
You have some instrumental variables Z that have all the right properties to be instrumental variables, in particular that the effect of Z on Y goes entirely through X.
Let γ be the coefficients for regressing Y on Z and δ be the coefficients for regressing X on Z. The effect of Z on Y is γ, and since this is all mediated by X the effect of X on Y is γ/δ, which can be estimated by ˆθ=ˆγ/ˆδ. Alternatively, we can take the fitted values ˆX=ˆδZ and regress Y on ˆX to give ˆθ directly. These give the same coefficients for X
- The ratio: ˆθ=ˆγˆδ=^var[Z]^cov[Z,Y]^var[Z]^cov[Z,X], where the hats are the usual empirical estimates, which simplifies to ^cov[Z,Y]/^cov[Z,X]
- The two-stage: ˆθ=(ˆXTˆX)−1ˆXTY. Noting that ˆX=PZX, where PZ is the projection on to the space spanned by Z, this is ˆθ=(XTPTZPZX)−1XTPTZY=(XTPZX)−1XTPTZY. The denominator is XTZ(ZTZ)−1ZTX and the numerator is XTZ(ZTZ)−1ZTY, and again everything except the last bit cancels.
More generally, you can have multiple Xs and at least that many Zs and do the two-stage thing. Again ˆX=ˆδX=PZX, and the estimator is ˆθ=(ˆXTˆX)−1ˆXTY.
Now for the key question. What is the (asymptotic?) variance of ˆθ? Is it just what you’d get from the second-stage model? No, but nearly.
We could write ˆθ in terms of the inputs ˆθ=(XTPTZPZX)−1XTPTZY, and then just use bilinearity of variances: var[ˆθ]=(XTPTZPZX)−1XTPTZvar[Y]PZX(XTPTZPZX)−1.
This looks very much like it will simplify to just the ordinary regression variance! There’s one trap: var[Y]. Our model says Y=Xθ+ϵ, for (say) iid Normal ϵ. It doesn’t say Y=ˆXθ+ϵ.
If we replace var[Y] by the estimator ˆσ2I, where I is the identity, we can’t use ˆσ2wrong=1n−df∑i(Yi−ˆXiˆθ)2 but we can use ˆσ2=1n−df∑i(Yi−Xiˆθ)2 to get a consistent estimator of the variance, for some suitable degree-of-freedom correction. The sandwich variance estimator then simplifies in the usual way as the middle and right-hand chunks cancel each other out to give var[ˆθ]=ˆσ2(XTPTZX)−1=ˆσ2(ˆXTˆX)−1.
Now lets’s do an example from AER::ivreg
suppressMessages(library(AER))
data("CigarettesSW", package = "AER")
CigarettesSW <- transform(CigarettesSW,
rprice = price/cpi,
rincome = income/population/cpi,
tdiff = (taxs - tax)/cpi
)
fm2 <- ivreg(log(packs) ~ log(rprice) | tdiff, data = CigarettesSW, subset = year == "1995")
Stage 1
stage1 <-lm(log(rprice)~tdiff, data = CigarettesSW, subset = year == "1995")
hatX <- fitted(stage1)
Y<-with(subset(CigarettesSW, year == "1995"), log(packs))
stage2<-lm(Y~hatX)
The unscaled covariance matrices are the same
summary(stage2)$cov.unscaled
## (Intercept) hatX
## (Intercept) 63.26849 -13.227909
## hatX -13.22791 2.766546
fm2$cov.unscaled
## (Intercept) log(rprice)
## (Intercept) 63.26849 -13.227909
## log(rprice) -13.22791 2.766546
The residual variance in stage2
is not correct
sd(resid(stage2))
## [1] 0.2240255
fm2$sigma
## [1] 0.1903539
But we can compute the right one using Y−ˆθX instead of Y−ˆθˆX
df<-stage2$df.residual
X<-with(subset(CigarettesSW, year == "1995"), log(rprice))
sqrt(sum((Y-cbind(1,X)%*%coef(stage2))^2)/df)
## [1] 0.1903539
The other example on the help page has univariate X, bivariate Z, and univariate adjustment variables A, which we handle by pasting A on to the right of both Z and X:
fm <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + tdiff + I(tax/cpi),
data = CigarettesSW, subset = year == "1995")
We get
X<- with(subset(CigarettesSW, year == "1995"), log(rprice))
Z<- with(subset(CigarettesSW, year == "1995"), cbind(tdiff,I(tax/cpi)))
A<-with(subset(CigarettesSW, year == "1995"), log(rincome))
stage1<-lm(cbind(X,A)~Z+A)
hatX<-fitted(stage1)
stage2<-lm(Y~hatX+A)
Compare the coefficients and unscaled covariance matrices:
coef(fm)
## (Intercept) log(rprice) log(rincome)
## 9.8949555 -1.2774241 0.2804048
coef(stage2)
## (Intercept) hatXX hatXA A
## 9.8949555 -1.2774241 0.2804048 NA
fm$cov.unscaled
## (Intercept) log(rprice) log(rincome)
## (Intercept) 31.7527079 -6.7990694 0.2898522
## log(rprice) -6.7990694 1.9629850 -0.9648723
## log(rincome) 0.2898522 -0.9648723 1.6127420
summary(stage2)$cov.unscaled
## (Intercept) hatXX hatXA
## (Intercept) 31.7527079 -6.7990694 0.2898522
## hatXX -6.7990694 1.9629850 -0.9648723
## hatXA 0.2898522 -0.9648723 1.6127420
and again we can’t use the residual variance from stage2
but we can fix it
fm$sigma
## [1] 0.187856
summary(stage2)$sigma
## [1] 0.2025322
df<-stage2$df.residual
theta<-na.omit(coef(stage2))
sqrt(sum( (Y-cbind(1,X,A)%*%theta)^2)/df)
## [1] 0.187856