Semiparametric efficiency and nearly-true models

Suppose you have $N$ people with some variables measured, and you choose a subset of $n$ to measure additional variables. I’m going to assume the probability $π_{i}$ that you measure the additional variables on person $i$ is known, so it has to be a setting where non-response isn’t an issue – eg, choosing which frozen blood samples to analyse, or which free-text questionnaire responses to code, or which medical records to pull for abstraction. As an example, if you have a binary outcome $Y$ you might take a case–control sample and measure $X$ on everyone with $Y = 1$ and the same number of people with $Y = 0$ .

Suppose in addition that you want to fit a particular parametric or semiparametric model $P_{θ, η}$ to the data, where $θ$ are parameters of interest and $η$ are nuisance parameters. For example, you might want to fit a logistic regression model where the coefficients are $θ$ and the density of $X$ is $η$ .

There are now two possible semiparametric models for the observed data. Let $R_{i}$ be the indicator that person $i$ is sampled. We could have

Model D: $π_{i} = E [R_{i} | variables available on everyone]$
Model M: the submodel of $D$ that satisfies $P_{θ, η}$

Typically, estimation under model M will be more efficient. For example, in the case-control setting with a logistic regression model for $Y | X$ we know that the efficient estimator under model M is unweighted logistic regression (per Prentice & Pyke 1979), and that the efficient estimator under model D is weighted logistic regression with weights $w_{i} = 1 / π_{i}$ .

I want to consider slight misspecifications, where model M is ‘nearly true’. Gross misspecifications aren’t interesting: if the data don’t look anything like a sample from $P_{θ, η}$ , a careful data analyst will notice and pick a different model. However, the difference between the efficient estimators under M and under D is $O_{p} (n^{- 1 / 2})$ , so a bias of the same order is enough to outweigh the precision gain. It’s not obvious that we should expect to detect a misspecification of this size, so more precise investigation is needed.

The efficient estimator under $D$ is an Augmented Inverse Probability Weighted (AIPW) estimator (if you’re a biostatistician) or a calibration estimator (if you’re a survey statistician), and we can get reasonably close to it (Breslow et al, 2009). Write ${\hat{θ}}_{wtd}$ for this estimator, and ${\hat{θ}}_{eff}$ for the efficient estimator under $M$ .

Models M and D agree when there is complete data, so I will define the true value $θ_{0}$ of $θ$ as the common limit of ${\hat{θ}}_{e f f}$ and ${\hat{θ}}_{w t d}$ with complete data. Survey statisticians call this the ‘census estimator.’ Biostatisticians call it ‘our next grant proposal’.

We now need a mathematical characterisation of ‘nearly true’. I will use contiguity. A sequence of distributions $Q_{n}$ is contiguous to a sequence $P_{n}$ if for every event $A$ , $P_{n} A \to 0$ implies $Q_{n} A \to 0$ . They are mutually contiguous if the implication goes both ways. Let $A$ be the event that a model diagnostic accepts model $M$ , and let $P_{n}$ be a sequence of distributions in model M. If this is a useful diagnostic, $P_{n} A ↛ 0$ , so for a mutually contiguous sequence of distributions $Q_{n}$ in model D but not in model M, $Q_{n} A ↛ 0$ .

Now, under M $\sqrt{n} ({\hat{θ}}_{eff} - θ_{0}) \overset{d}{\to} N (0, σ^{2})$ and $\sqrt{n} ({\hat{θ}}_{wtd} - θ_{0}) \overset{d}{\to} N (0, σ^{2} + ω^{2})$

By the Convolution Theorem, the extra variance for ${\hat{θ}}_{wtd}$ under model M is pure noise, so $\sqrt{n} ({\hat{θ}}_{eff} - {\hat{θ}}_{wtd}) \overset{d}{\to} N (0, ω^{2})$

Now, by LeCam’s Third Lemma, if we switch from $P_{n}$ to $Q_{n}$ as the data distribution there is no change in variance, but there is bias $\sqrt{n} ({\hat{θ}}_{eff} - {\hat{θ}}_{wtd}) \overset{d}{\to} N (- κ ρ ω, ω^{2})$ where $κ$ is the limit of the log likelihood ratio $\log d Q_{n} / d P_{n}$ , which governs the power of the Neyman–Pearson Lemma test, and $ρ$ measures whether the misspecification is in a direction that matters for $θ$ or not.

Substituting back, under the contiguous misspecified model sequence $Q_{n}$ , $\sqrt{n} ({\hat{θ}}_{eff} - θ_{0}) \overset{d}{\to} N (- κ ρ ω, σ^{2})$ and $\sqrt{n} ({\hat{θ}}_{wtd} - θ_{0}) \overset{d}{\to} N (0, σ^{2} + ω^{2})$ So, the mean squared error of ${\hat{θ}}_{wtd}$ is lower if $κ^{2} ρ^{2} > 1$ . If $ρ \approx 1$ , this happens when $κ \approx 1$ , at which point the most powerful test for $Q_{n}$ vs $P_{n}$ has power about 24%.

That is, the least-favourable misspecification of model M leads to worse mean squared error for ${\hat{θ}}_{eff}$ than ${\hat{θ}}_{wtd}$ before the most powerful test of misspecification is even moderately reliable, even if we (unrealistically) knew exactly the form of the misspecification.

Since the sense in which ${\hat{θ}}_{eff}$ is optimal is precisely this local asymptotic minimax sense within $P_{θ, η}$ , it seems reasonable to use the same description of optimality outside the model. Under this description of optimality, the ‘efficient’ estimator’s optimality is not robust to undetectable model misspecification.