I’ve written a bunch of times about nearly-true models. The idea is that you have some regression model for you’re trying to fit with data from a two-phase sample with known sampling probabilities for individual . You know and some auxililary variables for everyone, but you know only for the subsample. If you had complete data, you’d fit a particular parametric model for , with parameters you’re interested in and nuisance parameters , call it .
You can assume
- the sampling model: just that the sampling probabilities are known
- the sampling+outcome model: that, in addition, truly follows
Under only the sampling model, the best estimator of is the optimal AIPW estimator : it weights the observations where we know by probabilities based on but adjusted using . Under the sampling+outcome models you can do better, and we’ll write for the semiparametric-efficient estimator.
I’m interested in how the estimators compare when the outcome model is nearly true. That is, the data actually come from a model which is close to for some , close enough that you wouldn’t be able to tell the difference given the amount of data you have. As increases, you can tell the difference better, so needs to move closer: we have a sequence contiguous to the ‘nearly true’ .
I’m defining the ‘true’ value of as the value you would estimate with complete data, where the two estimators agree. What I’ve shown in the past is that (for large enough ) you can always find where the outcome model can’t be reliably rejected but where has higher mean squared error than
The standard theoretical result in this direction is the local asymptotic minimax theorem. Here’s the version from van der Vaart & Wellner’s book
3.11.5 Theorem (Minimax theorem). Let the sequence of experiments be asymptotically normal and the sequence of parameters be regular. Suppose a tight, Borel measurable Gaussian element , as in the statement of the convolution theorem, exists. Then for every asymptotically -measurable estimator sequence and -subconvex function , Here the first supremum is taken over all finite subsets of .
That might need a little translation. is the model space, which is a (possibly infinite-dimensional) vector space. is a way to define a distribution near some distribution ; think of it as being different by . In our setting ; it says how fast everything needs to scale to be just interesting different. The parameters are the parameters you’re interested in: in our case, is the ‘true’ value of for the distribution . We can tiptoe past the measurability assumption, because we’re not Jon Wellner, and -subconvex function in our case is just the Euclidean squared error – the point of having more generality in the theorem is to show that the result isn’t something weird about squared-error loss. is the expectation under and is the expectation under the limiting . Finally, is the limiting Normal distribution of the efficient estimator in whatever model we’re working in.
If we were working in a parametric model you could take to be a ball of some radius , and the result would then say that for any there’s some sequence of at distances from where the mean squared error of is asymptotically no better than it is for the efficient estimator. That is, can’t be better than the efficient estimator uniformly even over very small neighbourhoods of a point. For technical reasons you can’t use balls of small radius as small neighbourhoods in the infinite-dimensional case, but the idea is the same.
We’d normally use the local asymptotic minimax theorem with the sampling+outcome model being , in which case it would show us that no estimator could beat . Instead, we’re going to use it with the sampling model being (or some well-behaved large submodel that I won’t try to specify here). The efficient estimator is , and is our alternative estimator , and we’re working at a point where the sampling+outcome model is true.
The theorem now talks about nearby distributions where the sampling model is true but the outcome model isn’t. There are sequences of converging to where (ie, ) doesn’t beat the weighted estimator . The efficiency advantage of doesn’t generalise even a very little way away from where the outcome model is true.
That’s less persuasive (I think) than my construction. First, it doesn’t show is better, just that it’s no worse. Second, the distance between the true and nearly-true model is for potentially diverging to infinity. In my construction, we reach equal mean squared error at an explicit, finite , and it keeps getting worse for larger .
The reason I can do better is because of regularity. The full power of the local asymptotic minimax theorem is needed for estimators with unsmooth behaviour as a function of : these can be silly counterexamples like Hodges’s superefficient estimator or useful ideas like the lasso, or something in between, like a regression estimator that adjusts only for statistically significant confounders.
A compromise estimator based on testing goodness of fit of the outcome model could mitigate the breakdown of . It still couldn’t do uniformly better than when the model was only nearly true – the local asymptotic minimax theorem guarantees that. It’s concievable that it could do well enough to be preferable.