Estimator vs estimate - Biased and Inefficient

In sandwich variance estimators, the middle of the sandwich¹ is the variance of the estimating function. If we have independent observations and estimate by solving $\sum_{i} U_{i} (θ) = 0$ we want $v a r [\sum_{i} U_{i} (θ)]$ , which we estimate by $\sum_{i} U_{i} (\hat{θ}) U_{i} (\hat{θ})^{T}$ . There are two issues here you might miss.

First, some people (younger me, for example) worry that $\sum U_{i} (\hat{θ})$ is always zero, by construction, so that surely $v a r [\sum U_{i} (\hat{θ})]$ will also be zero. This is a nice simple mistake of confusing the random variable $\hat{θ}$ with its value in a particular data set. In our data, we might have $\hat{θ} = 42.69$ , so that $\sum_{i} U_{i} (42.69) = 0.$ In other data sets sampled from the same distribution, $\hat{θ}$ will have some other value and $\sum_{i} U_{i} (42.69)$ should be small, but it won’t be zero. When we write $v a r [\sum_{i} U_{i} (\hat{θ})]$ the notation $\hat{θ}$ doesn’t mean the estimator as a random variable, it means the value that the estimator took in this sample. We want ${v a r}_{θ = 42.69} [\sum_{i} U_{i} (42.69)] .$

In other data sets sampled from the same distribution we won’t have $\hat{θ} = 42.69$ , so $v a r [\sum U_{i} (42.69)]$ is not zero and is a reasonable estimator of $v a r [\sum U_{i} (θ_{0})]$ if 42.69 is close to $θ_{0}$ , which we know happens with high probability.

And, in addition, (assuming independence) ${v a r}_{θ = 42.69} [\sum_{i} U_{i} (42.69)] .$ is well estimated by $\sum_{i} U_{i} (42.69) U_{i} (42.69)^{T} .$ which we can evaluate without assuming a particular parametric model to compute the variances. We can prove this without a great deal of subtlety: just use Chebyshev’s inequality on the iid sum.

Things get more interesting when we have dependence. We need to include some crossproduct terms $U_{i} U_{j}^{T}$ in the variance calculation, representing non-zero covariances of $U_{i}$ and $U_{j}$ . It’s still true² that $E [\sum_{i, j} U_{i} (θ_{0}) U_{j} (θ_{0})] = v a r [\sum_{i} U_{i} (θ_{0})],$ but it’s no longer true that

$\sum_{i, j} U_{i} (\hat{θ}) U_{j} (\hat{θ}) \approx v a r [\sum_{i} U_{i} (θ_{0})] .$ The left-hand side of this is identically zero, just like it wasn’t for the iid case. The left-hand side evaluates to $(\sum_{i} U_{i} (\hat{θ})) {(\sum_{i} U_{i} (\hat{θ}))}^{T} = 00^{T} .$ That’s actually a bit surprising. The bias³ ⁴ from using $\hat{θ}$ instead of $θ_{0}$ has grown from asymptotically negligible to the whole estimator. ⁵

The difference between $θ_{0}$ and $\hat{θ}$ is $O_{p} (n^{- 1 / 2})$ , so the difference between $U_{i}^{2} (θ_{0})$ and $U_{i}^{2} (\hat{θ})$ is $O_{p} (n^{- 1})$ , giving a bias of size $O_{p} (1)$ in a variance matrix that’s of order $n$ . So in the iid case the bias is asymptotically negligible. In the completely correlated case we have $n^{2}$ copies of the bias, so it’s only $O_{p} (n)$ . That doesn’t prove there’s a problem, since $O_{p}$ terms only give upper bounds, but we already knew there was a problem and it makes sense.

This counting argument suggests that we need $o (n^{2})$ terms in the sum to get a sandwich estimator that works. Or, given that we’d like a little margin to get a rate of convergence, maybe $O (n^{2 - δ})$ terms. That’s potentially a problem – $n \times n$ is not $O (n^{2 - δ})$ ⁶ – but in some settings we know that some of the $c o v [U_{i}, U_{j}]$ terms are exactly zero and we can leave them out.

In longitudinal data with $m$ observations on each of $M$ units, we have $n = m M$ data points but all pairs $(i, j)$ of observations on two distinct units will have zero covariance. We can replace $U_{i} (\hat{θ}) U_{j} (\hat{θ})$ for those pairs by just 0. The number of pairs we have left is $M m^{2}$ , so if $M \to \infty$ we have $M m^{2} / n^{2} \to 0$ and if $M \to \infty$ and $m$ is bounded by any power of $M$ we have $M m^{2} = O (n^{2 - δ})$ . In crossed clustering with $m$ and $M$ distinct groups of two types, again we need $m, M \to \infty$ and $m$ bounded above and below by some power of $M$ . In both these cases the proofs are again just brute-forcing counting applications of Chebyshev’s inequality and a suitable Taylor series expansion

For time series and spatial data it’s a bit more tricky since we don’t ever have exact independence of $U_{i}$ and $U_{j}$ . Here we need to drop $(i, j)$ terms where $c o v [U_{i}, U_{j}]$ is small enough, and we need to chose the threshold for small enough to get stricter with increasing $n$ at a rate that lets us use Chebyshev’s inequality on the non-zero terms we keep but still bound the bias from dropping terms.⁷ ⁸

So, in conclusion, sometimes seems like it should matter that we’re using $\hat{θ}$ instead of $θ_{0}$ in the sandwich estimator but it really doesn’t, and sometimes it doesn’t seem like it should matter but it really does.

meat for people who take their metaphors too seriously, cheese for people who take their metaphors too seriously and are vegetarian↩︎
yes, under some assumptions. I’m happy to assume as many finite moments as I need↩︎
which I called centering bias in my PhD thesis↩︎
well, “centring bias”, because Americans↩︎
you might ask: what if we used $θ_{0}$ in the estimator instead of $\hat{θ}$ ? It still doesn’t work: ${(\sum_{i} U_{i} (θ_{0}))}^{\otimes 2}$ is unbiased for the variance but it’s not consistent↩︎
except perhaps to ChatGPT↩︎
I called it truncation bias in my thesis, which Americans don’t spell weirdly↩︎
If you want to make assumptions on the outcome variable that imply correlation thresholds on the estimating functions, you need to look up mixing coefficients↩︎