4 min read

Two quick survey items

Can we invent the case-control design?

Classical survey analysis is about means and totals, and the way to adapt it to more interesting parameters is to write the parameter as the mean of its influence functions (delta-betas, jackknife values, etc)

Suppose we knew for everyone in a population (maybe an HMO) whether they had a disease (Y=1 or didn’t (Y=0) and we wanted to take a sample, measure a variable X, and do logistic regression. What sampling probabilities should we use?

The optimal stratified sampling design for estimating a total is `Neyman allocation’, where the number of people we sample in each stratum is proportional to the size of the stratum in the population and to the standard deviation of the variable.

In our case the variable is the influence function for a logistic regression coefficient β, which is proportional to the score function, which is Ui=Xi(Yipi) where pi is the fitted probability for person i.

Let’s assume we have a rare disease (E[Y]=p0) and modest covariate effects, so pi1 for all i. In the case stratum, Ui=Xi(1pi), so var[Ui|Y=1]var[Xi|Yi=1]var[X] where the last approximate equality is exact if β=0 or if X is Normal and is pretty good otherwise.

In the control stratum Ui=Xipi, so var[Ui|Y=0]p02var[X]. This approximation isn’t as good as the case one, since pi could vary quite a bit while 1pi stays roughly constant: typically the control variance will be a bit bigger.

Neyman allocation says we need to take the population stratum sizes Nh and the population stratum standard deviations Sh and compute NhSh for each stratum h. Under our approximations, these come to Np0var[X] for cases and N(1p0)p02var[X] for controls, which are about equal. We should take the same number of cases and controls when covariate effects are small; we should probably take a few more cases when covariate effects are large.

Note that this is for the design-weighted logistic regression estimator, but it’s pretty insenstive to how efficient this weighted estimator is (which ranges from fully efficient to horribly inefficient depending on β and the distribution of X.)

Variances in two-phase designs

This is an explanation of the internals of twophase2.R in the survey package.

In a two-phase sample you take a sample, then take a sample from it. Two-phase sampling generalises two-stage sampling in that the sampling probabilities for the second phase are allowed to depend on data observed at the first phase.

The sampling weight πi for unit i is the product of the probability of sampling unit i at phase one (πi1) multiplied by the probability of sampling unit i at phase two, conditional on the whole phase-one sample we took (πi,2|1). This is not the marginal probability of sampling unit i, as in the Horvitz-Thompson estimator. The marginal probability πi would be the average of πi over all phase-1 samples that include unit i, which in your case you have not got. Fortunately, you can use πi just like you’d use πi. (An interesting question: if you did have πi would it be better or worse to use it instead?)

In particular, we can use the same form of variance estimator as for the Horvitz-Thompson estimator, which has the deceptively compact form var^[T^X]=i,jΔˇijXˇiXˇj. Here, Δij is the covariance of the sampling indicators for units i and j, and the hacek/caron accent indicates weighting. That is Xˇi=Xi/πi and Δˇij=Δij/πij where πij is the pairwise inclusion version of πi. Strictly speaking, it’s Δ=πijπiπj but that’s too many stars to bother writing.

The advantage of this form is that Δˇ composes nicely over stages and phases of sampling. If you have one stage or phase of sampling with Δˇ1 and another with Δˇ2, the overall weighted covariance is Δˇ=Δˇ1+Δˇ2Δˇ1Δˇ2. (The means this is the element-wise (Hadamard) product, not the matrix product.)

We still need to know what Δˇij is. If we have simple random sampling (potentially of clusters, within strata) of n units out of N, then Δˇ=(1n/N)/(n1)=(1πi)/(n1). The last form is especially useful, because when we’re working with a particular stage of sampling we’re going to have the probabilities at that stage and the sample size at that stage conveniently available, but we might not have N and πij conveniently available.