4 min read

Semiparametric efficiency and nearly-true models

Suppose you have N people with some variables measured, and you choose a subset of n to measure additional variables. I’m going to assume the probability πi that you measure the additional variables on person i is known, so it has to be a setting where non-response isn’t an issue – eg, choosing which frozen blood samples to analyse, or which free-text questionnaire responses to code, or which medical records to pull for abstraction. As an example, if you have a binary outcome Y you might take a case–control sample and measure X on everyone with Y=1 and the same number of people with Y=0.

Suppose in addition that you want to fit a particular parametric or semiparametric model Pθ,η to the data, where θ are parameters of interest and η are nuisance parameters. For example, you might want to fit a logistic regression model where the coefficients are θ and the density of X is η.

There are now two possible semiparametric models for the observed data. Let Ri be the indicator that person i is sampled. We could have

  • Model D: πi=E[Ri|variables available on everyone]
  • Model M: the submodel of D that satisfies Pθ,η

Typically, estimation under model M will be more efficient. For example, in the case-control setting with a logistic regression model for Y|X we know that the efficient estimator under model M is unweighted logistic regression (per Prentice & Pyke 1979), and that the efficient estimator under model D is weighted logistic regression with weights wi=1/πi.

I want to consider slight misspecifications, where model M is ‘nearly true’. Gross misspecifications aren’t interesting: if the data don’t look anything like a sample from Pθ,η, a careful data analyst will notice and pick a different model. However, the difference between the efficient estimators under M and under D is Op(n1/2), so a bias of the same order is enough to outweigh the precision gain. It’s not obvious that we should expect to detect a misspecification of this size, so more precise investigation is needed.

The efficient estimator under D is an Augmented Inverse Probability Weighted (AIPW) estimator (if you’re a biostatistician) or a calibration estimator (if you’re a survey statistician), and we can get reasonably close to it (Breslow et al, 2009). Write θ^wtd for this estimator, and θ^eff for the efficient estimator under M.

Models M and D agree when there is complete data, so I will define the true value θ0 of θ as the common limit of θ^eff and θ^wtd with complete data. Survey statisticians call this the ‘census estimator.’ Biostatisticians call it ‘our next grant proposal’.

We now need a mathematical characterisation of ‘nearly true’. I will use contiguity. A sequence of distributions Qn is contiguous to a sequence Pn if for every event A, PnA0 implies QnA0. They are mutually contiguous if the implication goes both ways. Let A be the event that a model diagnostic accepts model M, and let Pn be a sequence of distributions in model M. If this is a useful diagnostic, PnA0, so for a mutually contiguous sequence of distributions Qn in model D but not in model M, QnA0.

Now, under M n(θ^effθ0)dN(0,σ2) and n(θ^wtdθ0)dN(0,σ2+ω2)

By the Convolution Theorem, the extra variance for θ^wtd under model M is pure noise, so n(θ^effθ^wtd)dN(0,ω2)

Now, by LeCam’s Third Lemma, if we switch from Pn to Qn as the data distribution there is no change in variance, but there is bias n(θ^effθ^wtd)dN(κρω,ω2) where κ is the limit of the log likelihood ratio logdQn/dPn, which governs the power of the Neyman–Pearson Lemma test, and ρ measures whether the misspecification is in a direction that matters for θ or not.

Substituting back, under the contiguous misspecified model sequence Qn, n(θ^effθ0)dN(κρω,σ2) and n(θ^wtdθ0)dN(0,σ2+ω2) So, the mean squared error of θ^wtd is lower if κ2ρ2>1. If ρ1, this happens when κ1, at which point the most powerful test for Qn vs Pn has power about 24%.

That is, the least-favourable misspecification of model M leads to worse mean squared error for θ^eff than θ^wtd before the most powerful test of misspecification is even moderately reliable, even if we (unrealistically) knew exactly the form of the misspecification.

Since the sense in which θ^eff is optimal is precisely this local asymptotic minimax sense within Pθ,η, it seems reasonable to use the same description of optimality outside the model. Under this description of optimality, the ‘efficient’ estimator’s optimality is not robust to undetectable model misspecification.