The fourth-root thing - Biased and Inefficient

This post is partly because I think the result is interesting and partly to see if anyone will tell me an original reference.

Suppose we get $\hat{β}$ by solving $U (β; α) = 0$ and that $α$ is a nuisance parameter we plug into the equation. Assume that for any fixed $α$ , $E [U (β_{0}; α)] = 0.$ Assume $U (β, α) = \frac{1}{n} \sum_{i = 1}^{n} U_{i} (β, α)$ and that $U$ converges pointwise (and in mean, assuming finite moments) to its expected value. Also assume enough other regularity that this leads to $\sqrt{n} (\hat{β} - β) \overset{d}{\to} N (0, σ^{2} (α)) .$

Examples include GEE with $α$ as the working correlation parameters, and raking with $α$ as the imputation model and calibration parameters, and stabilised weights with $α$ as the stabilising model parameters.

Now, suppose we have an estimator $\hat{α}$ whose limit in probability exists; we’ll call it $α^{*}$ . With enough regularity to differentiate under the expectation $\frac{\partial}{\partial α} {E [U (β_{0}; α)] |}_{α^{*}} = 0 = E [{\frac{\partial}{\partial α} U (β_{0}; α) |}_{α^{*}}]$ As the derivative has zero mean, the law of large numbers says ${\frac{\partial}{\partial α} U (β_{0}; α) |}_{α^{*}} = o_{p} (1)$ and the central limit theorem says ${\frac{\partial}{\partial α} U (β_{0}; α) |}_{α^{*}} = O_{p} (n^{- 1 / 2})$ On the other hand, the derivative with respect to $β$ does not have mean zero, so it is $O_{p} (1)$ . In a parametric model it would be the average per-observation observed Fisher information.

A Taylor series expansion about $(β_{0}, α^{*})$ gives $\begin{aligned} U (\hat{β}, \hat{α}) = U (β_{0}, α^{*}) = & U (β_{0}, α^{*}) + (\hat{α} - α^{*}) \frac{\partial}{\partial α} U (β_{0}; α^{*}) \\ + (\hat{β} - β_{0}) \frac{\partial}{\partial β} U (β_{0}; α^{*}) \\ + O_{p} (‖ \hat{α} - α^{*} ‖_{2}^{2}) + O_{p} (‖ \hat{β} - β_{0} ‖_{2}^{2}) \end{aligned}$ If $\hat{α} - α^{*} = o_{p} (n^{- 1 / 4})$ then the second, fourth, and fifth terms are $o_{p} (n^{- 1 / 2})$ so $U (\hat{β}, \hat{α}) = U (β_{0}, α^{*}) = U (β_{0}, α^{*}) + (\hat{β} - β_{0}) \frac{\partial}{\partial β} U (β_{0}; α^{*}) + o_{p} (n^{- 1 / 2})$ Under the standard smoothness/moment assumptions we can rearrange to $\hat{β} - β_{0} = {[\frac{\partial}{\partial β} U (β_{0}; α^{*})]}^{- 1} U (β_{0}, α^{*}) + o_{p} (n^{- 1 / 2})$ so the distribution of $\hat{β}$ depends on $\hat{α}$ only through $α^{*}$ . ◼️

For most purposes the fourth-root condition doesn’t really matter: if you have a fixed finite-dimensional parameter that you can estimate at all, you can probably estimate it at root- $n$ rate, and if your parameters are infinite-dimensional or growing in size with $n$ you need to worry about more than just powers of $n$ in remainders. However, if you needed root- $n$ convergence you’d worry that low efficiency would be a problem in sub-asymptotic settings, which is less of a worry if you know fourth-root consistency is enough.

I worked this argument out for the GEE case, back when I was a PhD student, but I certainly wasn’t the first person to do so. I have been told that the first person to come up with the fourth-root part of it was Whitney Newey, which would make sense, but I don’t have a reference. If you know that reference or any early (mid 90s or earlier) reference, I’d like to hear about it.

The Biometrika GEE paper in 1986 has the essential idea that $\partial_{α} E [U (β_{0}, α)] = 0$ , but it assumes $n^{1 / 2}$ consistency for $α$ . Also, some people at the time (and since) have been confused by its using ‘consistency’ both for the assumption that $\hat{β}$ converges to its true value $β_{0}$ and the assumption that $\hat{α}$ converges to something.