Suppose you have people with some variables measured, and you choose a subset of to measure additional variables. I’m going to assume the probability that you measure the additional variables on person is known, so it has to be a setting where non-response isn’t an issue – eg, choosing which frozen blood samples to analyse, or which free-text questionnaire responses to code, or which medical records to pull for abstraction. As an example, if you have a binary outcome you might take a case–control sample and measure on everyone with and the same number of people with .
Suppose in addition that you want to fit a particular parametric or semiparametric model to the data, where are parameters of interest and are nuisance parameters. For example, you might want to fit a logistic regression model where the coefficients are and the density of is .
There are now two possible semiparametric models for the observed data. Let be the indicator that person is sampled. We could have
- Model D:
- Model M: the submodel of that satisfies
Typically, estimation under model M will be more efficient. For example, in the case-control setting with a logistic regression model for we know that the efficient estimator under model M is unweighted logistic regression (per Prentice & Pyke 1979), and that the efficient estimator under model D is weighted logistic regression with weights .
I want to consider slight misspecifications, where model M is ‘nearly true’. Gross misspecifications aren’t interesting: if the data don’t look anything like a sample from , a careful data analyst will notice and pick a different model. However, the difference between the efficient estimators under M and under D is , so a bias of the same order is enough to outweigh the precision gain. It’s not obvious that we should expect to detect a misspecification of this size, so more precise investigation is needed.
The efficient estimator under is an Augmented Inverse Probability Weighted (AIPW) estimator (if you’re a biostatistician) or a calibration estimator (if you’re a survey statistician), and we can get reasonably close to it (Breslow et al, 2009). Write for this estimator, and for the efficient estimator under .
Models M and D agree when there is complete data, so I will define the true value of as the common limit of and with complete data. Survey statisticians call this the ‘census estimator.’ Biostatisticians call it ‘our next grant proposal’.
We now need a mathematical characterisation of ‘nearly true’. I will use contiguity. A sequence of distributions is contiguous to a sequence if for every event , implies . They are mutually contiguous if the implication goes both ways. Let be the event that a model diagnostic accepts model , and let be a sequence of distributions in model M. If this is a useful diagnostic, , so for a mutually contiguous sequence of distributions in model D but not in model M, .
Now, under M and
By the Convolution Theorem, the extra variance for under model M is pure noise, so
Now, by LeCam’s Third Lemma, if we switch from to as the data distribution there is no change in variance, but there is bias where is the limit of the log likelihood ratio , which governs the power of the Neyman–Pearson Lemma test, and measures whether the misspecification is in a direction that matters for or not.
Substituting back, under the contiguous misspecified model sequence , and So, the mean squared error of is lower if . If , this happens when , at which point the most powerful test for vs has power about 24%.
That is, the least-favourable misspecification of model M leads to worse mean squared error for than before the most powerful test of misspecification is even moderately reliable, even if we (unrealistically) knew exactly the form of the misspecification.
Since the sense in which is optimal is precisely this local asymptotic minimax sense within , it seems reasonable to use the same description of optimality outside the model. Under this description of optimality, the ‘efficient’ estimator’s optimality is not robust to undetectable model misspecification.