Local asymptotic minimax, and nearly-true models

I’ve written a bunch of times about nearly-true models. The idea is that you have some regression model for $Y | X$ you’re trying to fit with data from a two-phase sample with known sampling probabilities $π_{i}$ for individual $i$ . You know $Y$ and some auxililary variables $A$ for everyone, but you know $X$ only for the subsample. If you had complete data, you’d fit a particular parametric model for $Y | X$ , with parameters $θ$ you’re interested in and nuisance parameters $η$ , call it $P_{θ, η}$ .

You can assume

the sampling model: just that the sampling probabilities are known
the sampling+outcome model: that, in addition, $Y | X$ truly follows $P_{θ, η}$

Under only the sampling model, the best estimator of $θ$ is the optimal AIPW estimator ${\hat{θ}}_{w}$ : it weights the observations where we know $X$ by probabilities based on $π_{i}$ but adjusted using $A$ . Under the sampling+outcome models you can do better, and we’ll write ${\hat{θ}}_{e}$ for the semiparametric-efficient estimator.

I’m interested in how the estimators compare when the outcome model is nearly true. That is, the data actually come from a model $Q$ which is close to $P_{θ, η}$ for some $(θ, η)$ , close enough that you wouldn’t be able to tell the difference given the amount of data you have. As $n$ increases, you can tell the difference better, so $Q$ needs to move closer: we have a sequence $Q_{n}$ contiguous to the ‘nearly true’ $P_{n} = P_{θ_{0}, η_{0}}$ .

I’m defining the ‘true’ value of $θ$ as the value you would estimate with complete data, where the two estimators agree. What I’ve shown in the past is that (for large enough $n$ ) you can always find $Q_{n}$ where the outcome model can’t be reliably rejected but where ${\hat{θ}}_{w}$ has higher mean squared error than ${\hat{θ}}_{e}$

The standard theoretical result in this direction is the local asymptotic minimax theorem. Here’s the version from van der Vaart & Wellner’s book

3.11.5 Theorem (Minimax theorem). Let the sequence of experiments $(X_{n}, A_{n}, P_{n, h} : h \in H)$ be asymptotically normal and the sequence of parameters $κ_{n} (h)$ be regular. Suppose a tight, Borel measurable Gaussian element $G$ , as in the statement of the convolution theorem, exists. Then for every asymptotically $B^{'}$ -measurable estimator sequence $T_{n}$ and $τ (B^{'})$ -subconvex function $ℓ$ , $sup_{I \subset H} \underset{n \to \infty}{lim inf} sup_{h \in I} E_{h ⋆} [r_{n} (T_{n} - κ_{n} (h))] \geq E [ℓ (G)]$ Here the first supremum is taken over all finite subsets $I$ of $H$ .

That might need a little translation. $H$ is the model space, which is a (possibly infinite-dimensional) vector space. $P_{n, h}$ is a way to define a distribution near some distribution $P$ ; think of it as being different by $h / \sqrt{n}$ . In our setting $r_{n} = \sqrt{n}$ ; it says how fast everything needs to scale to be just interesting different. The parameters $κ_{n} (h)$ are the parameters you’re interested in: in our case, $κ_{n} (h)$ is the ‘true’ value of $θ$ for the distribution $P_{n, h}$ . We can tiptoe past the measurability assumption, because we’re not Jon Wellner, and $τ (B^{'})$ -subconvex function in our case is just the Euclidean squared error – the point of having more generality in the theorem is to show that the result isn’t something weird about squared-error loss. $E_{h}$ is the expectation under $P_{n, h}$ and $E$ is the expectation under the limiting $P$ . Finally, $G$ is the limiting Normal distribution of the efficient estimator in whatever model we’re working in.

If we were working in a parametric model you could take $I$ to be a ball of some radius $δ$ , and the result would then say that for any $δ_{n} \to \infty$ there’s some sequence of $P_{n, h}$ at distances $δ_{n} / \sqrt{n}$ from $P$ where the mean squared error of $T_{n}$ is asymptotically no better than it is for the efficient estimator. That is, $T_{n}$ can’t be better than the efficient estimator uniformly even over very small neighbourhoods of a point. For technical reasons you can’t use balls of small radius as small neighbourhoods in the infinite-dimensional case, but the idea is the same.

We’d normally use the local asymptotic minimax theorem with the sampling+outcome model being $H$ , in which case it would show us that no estimator $T_{n}$ could beat ${\hat{θ}}_{e}$ . Instead, we’re going to use it with the sampling model being $H$ (or some well-behaved large submodel that I won’t try to specify here). The efficient estimator is ${\hat{θ}}_{w}$ , and ${\hat{θ}}_{e}$ is our alternative estimator $T_{n}$ , and we’re working at a point $P$ where the sampling+outcome model is true.

The theorem now talks about nearby distributions $P_{n, h}$ where the sampling model is true but the outcome model isn’t. There are sequences of $P_{n, h}$ converging to $P$ where ${\hat{θ}}_{e}$ (ie, $T_{n}$ ) doesn’t beat the weighted estimator ${\hat{θ}}_{w}$ . The efficiency advantage of ${\hat{θ}}_{e}$ doesn’t generalise even a very little way away from where the outcome model is true.

That’s less persuasive (I think) than my construction. First, it doesn’t show ${\hat{θ}}_{w}$ is better, just that it’s no worse. Second, the distance between the true and nearly-true model is $δ_{n} / \sqrt{n}$ for $δ_{n}$ potentially diverging to infinity. In my construction, we reach equal mean squared error at an explicit, finite $δ$ , and it keeps getting worse for larger $δ$ .

The reason I can do better is because of regularity. The full power of the local asymptotic minimax theorem is needed for estimators with unsmooth behaviour as a function of $h$ : these can be silly counterexamples like Hodges’s superefficient estimator or useful ideas like the lasso, or something in between, like a regression estimator that adjusts only for statistically significant confounders.

A compromise estimator based on testing goodness of fit of the outcome model could mitigate the breakdown of ${\hat{θ}}_{e}$ . It still couldn’t do uniformly better than ${\hat{θ}}_{w}$ when the model was only nearly true – the local asymptotic minimax theorem guarantees that. It’s concievable that it could do well enough to be preferable.