4 min read

Estimator vs estimate

In sandwich variance estimators, the middle of the sandwich1 is the variance of the estimating function. If we have independent observations and estimate by solving iUi(θ)=0 we want var[iUi(θ)], which we estimate by iUi(θ^)Ui(θ^)T. There are two issues here you might miss.

First, some people (younger me, for example) worry that Ui(θ^) is always zero, by construction, so that surely var[Ui(θ^)] will also be zero. This is a nice simple mistake of confusing the random variable θ^ with its value in a particular data set. In our data, we might have θ^=42.69, so that iUi(42.69)=0. In other data sets sampled from the same distribution, θ^ will have some other value and iUi(42.69) should be small, but it won’t be zero. When we write var[iUi(θ^)] the notation θ^ doesn’t mean the estimator as a random variable, it means the value that the estimator took in this sample. We want varθ=42.69[iUi(42.69)].

In other data sets sampled from the same distribution we won’t have θ^=42.69, so var[Ui(42.69)] is not zero and is a reasonable estimator of var[Ui(θ0)] if 42.69 is close to θ0, which we know happens with high probability.

And, in addition, (assuming independence) varθ=42.69[iUi(42.69)]. is well estimated by iUi(42.69)Ui(42.69)T. which we can evaluate without assuming a particular parametric model to compute the variances. We can prove this without a great deal of subtlety: just use Chebyshev’s inequality on the iid sum.

Things get more interesting when we have dependence. We need to include some crossproduct terms UiUjT in the variance calculation, representing non-zero covariances of Ui and Uj. It’s still true2 that E[i,jUi(θ0)Uj(θ0)]=var[iUi(θ0)], but it’s no longer true that

i,jUi(θ^)Uj(θ^)var[iUi(θ0)]. The left-hand side of this is identically zero, just like it wasn’t for the iid case. The left-hand side evaluates to (iUi(θ^))(iUi(θ^))T=00T. That’s actually a bit surprising. The bias3 4 from using θ^ instead of θ0 has grown from asymptotically negligible to the whole estimator. 5

The difference between θ0 and θ^ is Op(n1/2), so the difference between Ui2(θ0) and Ui2(θ^) is Op(n1), giving a bias of size Op(1) in a variance matrix that’s of order n. So in the iid case the bias is asymptotically negligible. In the completely correlated case we have n2 copies of the bias, so it’s only Op(n). That doesn’t prove there’s a problem, since Op terms only give upper bounds, but we already knew there was a problem and it makes sense.

This counting argument suggests that we need o(n2) terms in the sum to get a sandwich estimator that works. Or, given that we’d like a little margin to get a rate of convergence, maybe O(n2δ) terms. That’s potentially a problem – n×n is not O(n2δ)6 – but in some settings we know that some of the cov[Ui,Uj] terms are exactly zero and we can leave them out.

In longitudinal data with m observations on each of M units, we have n=mM data points but all pairs (i,j) of observations on two distinct units will have zero covariance. We can replace Ui(θ^)Uj(θ^) for those pairs by just 0. The number of pairs we have left is Mm2, so if M we have Mm2/n20 and if M and m is bounded by any power of M we have Mm2=O(n2δ). In crossed clustering with m and M distinct groups of two types, again we need m,M and m bounded above and below by some power of M. In both these cases the proofs are again just brute-forcing counting applications of Chebyshev’s inequality and a suitable Taylor series expansion

For time series and spatial data it’s a bit more tricky since we don’t ever have exact independence of Ui and Uj. Here we need to drop (i,j) terms where cov[Ui,Uj] is small enough, and we need to chose the threshold for small enough to get stricter with increasing n at a rate that lets us use Chebyshev’s inequality on the non-zero terms we keep but still bound the bias from dropping terms.7 8

So, in conclusion, sometimes seems like it should matter that we’re using θ^ instead of θ0 in the sandwich estimator but it really doesn’t, and sometimes it doesn’t seem like it should matter but it really does.


  1. meat for people who take their metaphors too seriously, cheese for people who take their metaphors too seriously and are vegetarian↩︎

  2. yes, under some assumptions. I’m happy to assume as many finite moments as I need↩︎

  3. which I called centering bias in my PhD thesis↩︎

  4. well, “centring bias”, because Americans↩︎

  5. you might ask: what if we used θ0 in the estimator instead of θ^? It still doesn’t work: (iUi(θ0))2 is unbiased for the variance but it’s not consistent↩︎

  6. except perhaps to ChatGPT↩︎

  7. I called it truncation bias in my thesis, which Americans don’t spell weirdly↩︎

  8. If you want to make assumptions on the outcome variable that imply correlation thresholds on the estimating functions, you need to look up mixing coefficients↩︎