Local asymptotic minimax, and nearly-true models

I’ve written a bunch of times about nearly-true models. The idea is that you have some regression model for \(Y|X\) you’re trying to fit with data from a two-phase sample with known sampling probabilities \(\pi_i\) for individual \(i\). You know \(Y\) and some auxililary variables \(A\) for everyone, but you know \(X\) only for the subsample. If you had complete data, you’d fit a particular parametric model for \(Y|X\), with parameters \(\theta\) you’re interested in and nuisance parameters \(\eta\), call it \(P_{\theta,\eta}\).

You can assume

the sampling model: just that the sampling probabilities are known
the sampling+outcome model: that, in addition, \(Y|X\) truly follows \(P_{\theta,\eta}\)

Under only the sampling model, the best estimator of \(\theta\) is the optimal AIPW estimator \(\hat\theta_w\) : it weights the observations where we know \(X\) by probabilities based on \(\pi_i\) but adjusted using \(A\). Under the sampling+outcome models you can do better, and we’ll write \(\hat\theta_e\) for the semiparametric-efficient estimator.

I’m interested in how the estimators compare when the outcome model is nearly true. That is, the data actually come from a model \(Q\) which is close to \(P_{\theta,\eta}\) for some \((\theta,\eta)\), close enough that you wouldn’t be able to tell the difference given the amount of data you have. As \(n\) increases, you can tell the difference better, so \(Q\) needs to move closer: we have a sequence \(Q_n\) contiguous to the ‘nearly true’ \(P_n=P_{\theta_0,\eta_0}\).

I’m defining the ‘true’ value of \(\theta\) as the value you would estimate with complete data, where the two estimators agree. What I’ve shown in the past is that (for large enough \(n\)) you can always find \(Q_n\) where the outcome model can’t be reliably rejected but where \(\hat\theta_w\) has higher mean squared error than \(\hat\theta_e\)

The standard theoretical result in this direction is the local asymptotic minimax theorem. Here’s the version from van der Vaart & Wellner’s book

3.11.5 Theorem (Minimax theorem). Let the sequence of experiments \((X_n,{\cal A}_n, P_{n,h} :h\in H)\) be asymptotically normal and the sequence of parameters \(\kappa_n(h)\) be regular. Suppose a tight, Borel measurable Gaussian element \(G\), as in the statement of the convolution theorem, exists. Then for every asymptotically \(B'\)-measurable estimator sequence \(T_n\) and \(\tau(B')\)-subconvex function \(\ell\), \[\sup_{I\subset H} \liminf_{n\to\infty}\sup_{h\in I}E_{h\star}[r_n(T_n-\kappa_n(h))]\geq E[\ell(G)]\] Here the first supremum is taken over all finite subsets \(I\) of \(H\).

That might need a little translation. \(H\) is the model space, which is a (possibly infinite-dimensional) vector space. \(P_{n,h}\) is a way to define a distribution near some distribution \(P\); think of it as being different by \(h/\sqrt{n}\). In our setting \(r_n=\sqrt{n}\); it says how fast everything needs to scale to be just interesting different. The parameters \(\kappa_n(h)\) are the parameters you’re interested in: in our case, \(\kappa_n(h)\) is the ‘true’ value of \(\theta\) for the distribution \(P_{n,h}\). We can tiptoe past the measurability assumption, because we’re not Jon Wellner, and \(\tau(B')\)-subconvex function in our case is just the Euclidean squared error – the point of having more generality in the theorem is to show that the result isn’t something weird about squared-error loss. \(E_h\) is the expectation under \(P_{n,h}\) and \(E\) is the expectation under the limiting \(P\). Finally, \(G\) is the limiting Normal distribution of the efficient estimator in whatever model we’re working in.

If we were working in a parametric model you could take \(I\) to be a ball of some radius \(\delta\), and the result would then say that for any \(\delta_n\to\infty\) there’s some sequence of \(P_{n,h}\) at distances \(\delta_n/\sqrt{n}\) from \(P\) where the mean squared error of \(T_n\) is asymptotically no better than it is for the efficient estimator. That is, \(T_n\) can’t be better than the efficient estimator uniformly even over very small neighbourhoods of a point. For technical reasons you can’t use balls of small radius as small neighbourhoods in the infinite-dimensional case, but the idea is the same.

We’d normally use the local asymptotic minimax theorem with the sampling+outcome model being \(H\), in which case it would show us that no estimator \(T_n\) could beat \(\hat\theta_e\). Instead, we’re going to use it with the sampling model being \(H\) (or some well-behaved large submodel that I won’t try to specify here). The efficient estimator is \(\hat\theta_w\), and \(\hat\theta_e\) is our alternative estimator \(T_n\), and we’re working at a point \(P\) where the sampling+outcome model is true.

The theorem now talks about nearby distributions \(P_{n,h}\) where the sampling model is true but the outcome model isn’t. There are sequences of \(P_{n,h}\) converging to \(P\) where \(\hat\theta_e\) (ie, \(T_n\)) doesn’t beat the weighted estimator \(\hat\theta_w\). The efficiency advantage of \(\hat\theta_e\) doesn’t generalise even a very little way away from where the outcome model is true.

That’s less persuasive (I think) than my construction. First, it doesn’t show \(\hat\theta_w\) is better, just that it’s no worse. Second, the distance between the true and nearly-true model is \(\delta_n/\sqrt{n}\) for \(\delta_n\) potentially diverging to infinity. In my construction, we reach equal mean squared error at an explicit, finite \(\delta\), and it keeps getting worse for larger \(\delta\).

The reason I can do better is because of regularity. The full power of the local asymptotic minimax theorem is needed for estimators with unsmooth behaviour as a function of \(h\): these can be silly counterexamples like Hodges’s superefficient estimator or useful ideas like the lasso, or something in between, like a regression estimator that adjusts only for statistically significant confounders.

A compromise estimator based on testing goodness of fit of the outcome model could mitigate the breakdown of \(\hat\theta_e\). It still couldn’t do uniformly better than \(\hat\theta_w\) when the model was only nearly true – the local asymptotic minimax theorem guarantees that. It’s concievable that it could do well enough to be preferable.