# Semiparametric efficiency and nearly-true models

Suppose you have $$N$$ people with some variables measured, and you choose a subset of $$n$$ to measure additional variables. I’m going to assume the probability $$\pi_i$$ that you measure the additional variables on person $$i$$ is known, so it has to be a setting where non-response isn’t an issue – eg, choosing which frozen blood samples to analyse, or which free-text questionnaire responses to code, or which medical records to pull for abstraction. As an example, if you have a binary outcome $$Y$$ you might take a case–control sample and measure $$X$$ on everyone with $$Y=1$$ and the same number of people with $$Y=0$$.

Suppose in addition that you want to fit a particular parametric or semiparametric model $${\cal P}_{\theta,\eta}$$ to the data, where $$\theta$$ are parameters of interest and $$\eta$$ are nuisance parameters. For example, you might want to fit a logistic regression model where the coefficients are $$\theta$$ and the density of $$X$$ is $$\eta$$.

There are now two possible semiparametric models for the observed data. Let $$R_i$$ be the indicator that person $$i$$ is sampled. We could have

• Model D: $$\pi_i=E[R_i|\textrm{variables available on everyone}]$$
• Model M: the submodel of $$D$$ that satisfies $${\cal P}_{\theta,\eta}$$

Typically, estimation under model M will be more efficient. For example, in the case-control setting with a logistic regression model for $$Y|X$$ we know that the efficient estimator under model M is unweighted logistic regression (per Prentice & Pyke 1979), and that the efficient estimator under model D is weighted logistic regression with weights $$w_i=1/\pi_i$$.

I want to consider slight misspecifications, where model M is ‘nearly true’. Gross misspecifications aren’t interesting: if the data don’t look anything like a sample from $${\cal P}_{\theta,\eta}$$, a careful data analyst will notice and pick a different model. However, the difference between the efficient estimators under M and under D is $$O_p(n^{-1/2})$$, so a bias of the same order is enough to outweigh the precision gain. It’s not obvious that we should expect to detect a misspecification of this size, so more precise investigation is needed.

The efficient estimator under $$D$$ is an Augmented Inverse Probability Weighted (AIPW) estimator (if you’re a biostatistician) or a calibration estimator (if you’re a survey statistician), and we can get reasonably close to it (Breslow et al, 2009). Write $$\hat\theta_{\textrm{wtd}}$$ for this estimator, and $$\hat\theta_{\textrm{eff}}$$ for the efficient estimator under $$M$$.

Models M and D agree when there is complete data, so I will define the true value $$\theta_0$$ of $$\theta$$ as the common limit of $$\hat\theta_{eff}$$ and $$\hat\theta_{wtd}$$ with complete data. Survey statisticians call this the ‘census estimator.’ Biostatisticians call it ‘our next grant proposal’.

We now need a mathematical characterisation of ‘nearly true’. I will use contiguity. A sequence of distributions $$Q_n$$ is contiguous to a sequence $$P_n$$ if for every event $$A$$, $$P_nA\to0$$ implies $$Q_nA\to 0$$. They are mutually contiguous if the implication goes both ways. Let $$A$$ be the event that a model diagnostic accepts model $$M$$, and let $$P_n$$ be a sequence of distributions in model M. If this is a useful diagnostic, $$P_nA\not\to 0$$, so for a mutually contiguous sequence of distributions $$Q_n$$ in model D but not in model M, $$Q_nA\not\to 0$$.

Now, under M $\sqrt{n}(\hat\theta_{\textrm{eff}}-\theta_0) \stackrel{d}{\to}N(0,\sigma^2)$ and $\sqrt{n}(\hat\theta_{\textrm{wtd}}-\theta_0) \stackrel{d}{\to}N(0,\sigma^2+\omega^2)$

By the Convolution Theorem, the extra variance for $$\hat\theta_{\textrm{wtd}}$$ under model M is pure noise, so $\sqrt{n}(\hat\theta_{\textrm{eff}}-\hat\theta_{\textrm{wtd}}) \stackrel{d}{\to} N(0,\omega^2)$

Now, by LeCam’s Third Lemma, if we switch from $$P_n$$ to $$Q_n$$ as the data distribution there is no change in variance, but there is bias $\sqrt{n}(\hat\theta_{\textrm{eff}}-\hat\theta_{\textrm{wtd}}) \stackrel{d}{\to} N(-\kappa\rho\omega,\omega^2)$ where $$\kappa$$ is the limit of the log likelihood ratio $$\log dQ_n/dP_n$$, which governs the power of the Neyman–Pearson Lemma test, and $$\rho$$ measures whether the misspecification is in a direction that matters for $$\theta$$ or not.

Substituting back, under the contiguous misspecified model sequence $$Q_n$$, $\sqrt{n}(\hat\theta_{\textrm{eff}}-\theta_0) \stackrel{d}{\to}N(-\kappa\rho\omega,\sigma^2)$ and $\sqrt{n}(\hat\theta_{\textrm{wtd}}-\theta_0) \stackrel{d}{\to}N(0,\sigma^2+\omega^2)$ So, the mean squared error of $$\hat\theta_{\textrm{wtd}}$$ is lower if $$\kappa^2\rho^2>1$$. If $$\rho\approx 1$$, this happens when $$\kappa\approx 1$$, at which point the most powerful test for $$Q_n$$ vs $$P_n$$ has power about 24%.

That is, the least-favourable misspecification of model M leads to worse mean squared error for $$\hat\theta_{\textrm{eff}}$$ than $$\hat\theta_{\textrm{wtd}}$$ before the most powerful test of misspecification is even moderately reliable, even if we (unrealistically) knew exactly the form of the misspecification.

Since the sense in which $$\hat\theta_{\textrm{eff}}$$ is optimal is precisely this local asymptotic minimax sense within $${\cal P}_{\theta,\eta}$$, it seems reasonable to use the same description of optimality outside the model. Under this description of optimality, the ‘efficient’ estimator’s optimality is not robust to undetectable model misspecification.