Superefficiency - Biased and Inefficient

If you have $X_{1}, \dots, X_{n}$ independent from an $N (μ, 1)$ distribution you don’t have to think too hard to work out that ${\bar{X}}_{n}$ , the sample mean, is the right estimator of $μ$ (unless you have quite detailed prior knowledge). As people who have taken an advanced course in mathematical statistics will know, there is a famous estimator that appears to do better.

Hodges’ estimator is given by $H_{n} = {\bar{X}}_{n}$ if $| {\bar{X}}_{n} | > n^{- 1 / 4}$ , and $H_{n} = 0$ if $| {\bar{X}}_{n} | \leq n^{- 1 / 4}$ . If $μ \neq 0$ , $H_{n} = {\bar{X}}_{n}$ for all large enough $n$ , so $\sqrt{n} (H_{n} - μ) \overset{d}{\to} N (0, 1)$ just as for ${\bar{X}}_{n}$ . On the other hand, if $μ = 0$ , $\sqrt{n} (H_{n} - μ) \overset{p}{\to} 0.$ $H_{n}$ is asymptotically better than ${\bar{X}}_{n}$ for $μ = 0$ and asymptotically as good for any other value of $μ$ . Of course there’s something wrong with it: it sucks for $n^{- 1 / 2} ≪ μ < n^{- 1 / 4}$ . Here’s its mean squared error:

Even Wikipedia knows this much. What I recently got around to doing was extending this to an estimator that’s asymptotically superior to ${\bar{X}}_{n}$ on a dense set. This isn’t new – Le Cam did it in his PhD thesis. It may even be the same as Le Cam’s construction (which isn’t online, as far as I can tell). [Actually, Le Cam’s construction is a draft exercise in a draft chapter for David Pollard’s long-awaited ‘Asymptopia’. And it is basically my one, so it’s quite likely that as a Pollard fan I got at least the idea from there.]

First, instead of just setting the estimate to zero when it’s close enough to zero, we can set it to the nearest integer when it’s close enough to an integer. Define ${\tilde{H}}_{n} = i$ if $| {\bar{X}}_{n} - i | < 0.5 n^{- 1 / 4}$ , with ${\tilde{H}}_{n} = {\bar{X}}_{n}$ otherwise.

If $n$ is large enough, we can shrink to multiples of 1/2. For example, using the same threshold for closeness, if $n > 16$ there is at most one multiple of 1/2 within $0.5 n^{- 1 / 4}$ . If $n > 256$ there is at most one multiple of 1/4 within that range.

Define $H_{n, k} = 2^{- k} i$ if $| x - 2^{- k} i | < 0.5 n^{- 1 / 4}$ and $H_{n, k} = {\bar{X}}_{n}$ otherwise. This is well-defined if $n > 2^{4 k}$ . For any fixed $k$ , ${\tilde{H}}_{n, k}$ satisfies $\sqrt{n} (H_{n} - μ) \overset{p}{\to} 0$ if $μ$ is a multiple of $2^{- k}$ and $\sqrt{n} (H_{n} - μ) \overset{d}{\to} N (0, 1)$ otherwise.

The obvious thing to do now is to let $k$ increase slowly with $n$ . This doesn’t work. Consider a value for $μ$ whose binary expansion has infinitely many 1s, but with increasingly many zeroes between them. Whatever your rule for $k (n)$ there will be values of this type that are close enough to multiples of $2^{- k (n)}$ to get pulled to the wrong value infinitely often as $n$ increases. $H_{n, k (n)}$ will be asymptotically superior to ${\bar{X}}_{n}$ on a dense set, but it will be asymptotically inferior on another dense set, violating the rules of the game.

What we can do is pick $k$ at random. The efficiency gain isn’t 100% as it was for fixed $k$ , but it’s still there.

Let $K$ be a random variable with probability mass function $p (k)$ , independent of the $X$ s. The distribution of $H_{n, K}$ conditional on $K = k$ is the distribution of $H_{n, k}$ . If $p (k) > 0$ for all $k$ , the probability of seeing $K = k$ infinitely often is 1, so we can look the limiting distribution of $\sqrt{n} (H_{n, K} - μ)$ along subsequences with $K = k$ . This limiting distribution is a point mass at zero if $2^{k} μ$ is an integer, and $N (0, 1)$ otherwise. So, $\sqrt{n} (H_{n, K} - μ) \overset{d}{\to} q_{k} δ_{0} + (1 - q_{k}) N (0, 1)$ where $q_{k} = \sum_{k} p_{k} I (2^{k} μ is an integer)$

For a dense set of real numbers, and in particular for all numbers representable in binary floating point, $H_{n, K}$ has greater asymptotic efficiency than the efficient estimator ${\bar{X}}_{n} .$ The disadvantage of this randomised construction is that working out the finite-sample MSE is just horrible to think about.

The other interesting thing to think about is why the ‘overflow’ heuristic doesn’t work. Why doesn’t superefficiency for all fixed $k$ translate into superefficiency for sufficiently-slowly increasing $k (n)$ ? As a heuristic, this sort of thing has been around since the early days of analysis, but it’s more than that: the field of non-standard analysis is basically about making it rigorous.

My guess is that $H_{n, k}$ for infinite $n$ is close to the superefficient distribution on the dense set only for ‘large enough’ infinite $k$ , and close to $N (0, 1)$ off the dense set only for ‘small enough’ infinite $k$ . The failure of the heuristic is similar to the failure in Cauchy’s invalid proof that a convergent sequence of continuous functinons has a continuous limit, the proof into which later analysis retconned the concepts of ‘uniform convergence’ and ‘equicontinuity’.