In sandwich variance estimators, the middle of the sandwich1 is the variance of the estimating function. If we have independent observations and estimate by solving we want , which we estimate by . There are two issues here you might miss.
First, some people (younger me, for example) worry that is always zero, by construction, so that surely will also be zero. This is a nice simple mistake of confusing the random variable with its value in a particular data set. In our data, we might have , so that In other data sets sampled from the same distribution, will have some other value and should be small, but it won’t be zero. When we write the notation doesn’t mean the estimator as a random variable, it means the value that the estimator took in this sample. We want
In other data sets sampled from the same distribution we won’t have , so is not zero and is a reasonable estimator of if 42.69 is close to , which we know happens with high probability.
And, in addition, (assuming independence) is well estimated by which we can evaluate without assuming a particular parametric model to compute the variances. We can prove this without a great deal of subtlety: just use Chebyshev’s inequality on the iid sum.
Things get more interesting when we have dependence. We need to include some crossproduct terms in the variance calculation, representing non-zero covariances of and . It’s still true2 that but it’s no longer true that
The left-hand side of this is identically zero, just like it wasn’t for the iid case. The left-hand side evaluates to That’s actually a bit surprising. The bias3 4 from using instead of has grown from asymptotically negligible to the whole estimator. 5
The difference between and is , so the difference between and is , giving a bias of size in a variance matrix that’s of order . So in the iid case the bias is asymptotically negligible. In the completely correlated case we have copies of the bias, so it’s only . That doesn’t prove there’s a problem, since terms only give upper bounds, but we already knew there was a problem and it makes sense.
This counting argument suggests that we need terms in the sum to get a sandwich estimator that works. Or, given that we’d like a little margin to get a rate of convergence, maybe terms. That’s potentially a problem – is not 6 – but in some settings we know that some of the terms are exactly zero and we can leave them out.
In longitudinal data with observations on each of units, we have data points but all pairs of observations on two distinct units will have zero covariance. We can replace for those pairs by just 0. The number of pairs we have left is , so if we have and if and is bounded by any power of we have . In crossed clustering with and distinct groups of two types, again we need and bounded above and below by some power of . In both these cases the proofs are again just brute-forcing counting applications of Chebyshev’s inequality and a suitable Taylor series expansion
For time series and spatial data it’s a bit more tricky since we don’t ever have exact independence of and . Here we need to drop terms where is small enough, and we need to chose the threshold for small enough to get stricter with increasing at a rate that lets us use Chebyshev’s inequality on the non-zero terms we keep but still bound the bias from dropping terms.7 8
So, in conclusion, sometimes seems like it should matter that we’re using instead of in the sandwich estimator but it really doesn’t, and sometimes it doesn’t seem like it should matter but it really does.
meat for people who take their metaphors too seriously, cheese for people who take their metaphors too seriously and are vegetarian↩︎
yes, under some assumptions. I’m happy to assume as many finite moments as I need↩︎
which I called centering bias in my PhD thesis↩︎
well, “centring bias”, because Americans↩︎
you might ask: what if we used in the estimator instead of ? It still doesn’t work: is unbiased for the variance but it’s not consistent↩︎
except perhaps to ChatGPT↩︎
I called it truncation bias in my thesis, which Americans don’t spell weirdly↩︎
If you want to make assumptions on the outcome variable that imply correlation thresholds on the estimating functions, you need to look up mixing coefficients↩︎