This post is partly because I think the result is interesting and partly to see if anyone will tell me an original reference.
Suppose we get by solving and that is a nuisance parameter we plug into the equation. Assume that for any fixed , Assume and that converges pointwise (and in mean, assuming finite moments) to its expected value. Also assume enough other regularity that this leads to
Examples include GEE with as the working correlation parameters, and raking with as the imputation model and calibration parameters, and stabilised weights with as the stabilising model parameters.
Now, suppose we have an estimator whose limit in probability exists; we’ll call it . With enough regularity to differentiate under the expectation As the derivative has zero mean, the law of large numbers says and the central limit theorem says On the other hand, the derivative with respect to does not have mean zero, so it is . In a parametric model it would be the average per-observation observed Fisher information.
A Taylor series expansion about gives If then the second, fourth, and fifth terms are so Under the standard smoothness/moment assumptions we can rearrange to so the distribution of depends on only through . ◼️
For most purposes the fourth-root condition doesn’t really matter: if you have a fixed finite-dimensional parameter that you can estimate at all, you can probably estimate it at root- rate, and if your parameters are infinite-dimensional or growing in size with you need to worry about more than just powers of in remainders. However, if you needed root- convergence you’d worry that low efficiency would be a problem in sub-asymptotic settings, which is less of a worry if you know fourth-root consistency is enough.
I worked this argument out for the GEE case, back when I was a PhD student, but I certainly wasn’t the first person to do so. I have been told that the first person to come up with the fourth-root part of it was Whitney Newey, which would make sense, but I don’t have a reference. If you know that reference or any early (mid 90s or earlier) reference, I’d like to hear about it.
The Biometrika GEE paper in 1986 has the essential idea that , but it assumes consistency for . Also, some people at the time (and since) have been confused by its using ‘consistency’ both for the assumption that converges to its true value and the assumption that converges to something.