Can we invent the case-control design?

Classical survey analysis is about means and totals, and the way to adapt it to more interesting parameters is to write the parameter as the mean of its influence functions (delta-betas, jackknife values, etc)

Suppose we knew for everyone in a population (maybe an HMO) whether they had a disease ( $Y = 1$ or didn’t ( $Y = 0$ ) and we wanted to take a sample, measure a variable $X$ , and do logistic regression. What sampling probabilities should we use?

The optimal stratified sampling design for estimating a total is `Neyman allocation’, where the number of people we sample in each stratum is proportional to the size of the stratum in the population and to the standard deviation of the variable.

In our case the variable is the influence function for a logistic regression coefficient $β$ , which is proportional to the score function, which is $U_{i} = X_{i} (Y_{i} - p_{i})$ where $p_{i}$ is the fitted probability for person $i$ .

Let’s assume we have a rare disease ( $E [Y] = p_{0}$ ) and modest covariate effects, so $p_{i} ≪ 1$ for all $i$ . In the case stratum, $U_{i} = X_{i} (1 - p_{i})$ , so $v a r [U_{i} | Y = 1] \approx v a r [X_{i} | Y_{i} = 1] \approx v a r [X]$ where the last approximate equality is exact if $β = 0$ or if $X$ is Normal and is pretty good otherwise.

In the control stratum $U_{i} = - X_{i} p_{i}$ , so $v a r [U_{i} | Y = 0] \approx p_{0}^{2} v a r [X] .$ This approximation isn’t as good as the case one, since $p_{i}$ could vary quite a bit while $1 - p_{i}$ stays roughly constant: typically the control variance will be a bit bigger.

Neyman allocation says we need to take the population stratum sizes $N_{h}$ and the population stratum standard deviations $S_{h}$ and compute $N_{h} S_{h}$ for each stratum $h$ . Under our approximations, these come to $N p_{0} \sqrt{v a r [X]}$ for cases and $N (1 - p_{0}) \sqrt{p_{0}^{2} v a r [X]}$ for controls, which are about equal. We should take the same number of cases and controls when covariate effects are small; we should probably take a few more cases when covariate effects are large.

Note that this is for the design-weighted logistic regression estimator, but it’s pretty insenstive to how efficient this weighted estimator is (which ranges from fully efficient to horribly inefficient depending on $β$ and the distribution of $X$ .)

Variances in two-phase designs

This is an explanation of the internals of twophase2.R in the survey package.

In a two-phase sample you take a sample, then take a sample from it. Two-phase sampling generalises two-stage sampling in that the sampling probabilities for the second phase are allowed to depend on data observed at the first phase.

The sampling weight $π_{i}^{*}$ for unit $i$ is the product of the probability of sampling unit $i$ at phase one ( $π_{i} 1$ ) multiplied by the probability of sampling unit $i$ at phase two, conditional on the whole phase-one sample we took ( $π_{i, 2 | 1}$ ). This is not the marginal probability of sampling unit $i$ , as in the Horvitz-Thompson estimator. The marginal probability $π_{i}$ would be the average of $π_{i}^{*}$ over all phase-1 samples that include unit $i$ , which in your case you have not got. Fortunately, you can use $π_{i}^{*}$ just like you’d use $π_{i}$ . (An interesting question: if you did have $π_{i}$ would it be better or worse to use it instead?)

In particular, we can use the same form of variance estimator as for the Horvitz-Thompson estimator, which has the deceptively compact form $\hat{v a r} [{\hat{T}}_{X}] = \sum_{i, j} {\overset{ˇ}{Δ}}_{i j} {\overset{ˇ}{X}}_{i} {\overset{ˇ}{X}}_{j} .$ Here, $Δ_{i j}$ is the covariance of the sampling indicators for units $i$ and $j$ , and the hacek/caron accent indicates weighting. That is ${\overset{ˇ}{X}}_{i} = X_{i} / π_{i}^{*}$ and ${\overset{ˇ}{Δ}}_{i j} = Δ_{i j} / π_{i j}^{*}$ where $π_{i j}^{*}$ is the pairwise inclusion version of $π_{i}^{*}$ . Strictly speaking, it’s $Δ^{*} = π_{i j}^{*} - π_{i}^{*} π_{j}^{*}$ but that’s too many stars to bother writing.

The advantage of this form is that $\overset{ˇ}{Δ}$ composes nicely over stages and phases of sampling. If you have one stage or phase of sampling with ${\overset{ˇ}{Δ}}_{1}$ and another with ${\overset{ˇ}{Δ}}_{2}$ , the overall weighted covariance is $\overset{ˇ}{Δ} = {\overset{ˇ}{Δ}}_{1} + {\overset{ˇ}{Δ}}_{2} - {\overset{ˇ}{Δ}}_{1} \cdot {\overset{ˇ}{Δ}}_{2} .$ (The $\cdot$ means this is the element-wise (Hadamard) product, not the matrix product.)

We still need to know what ${\overset{ˇ}{Δ}}_{i j}$ is. If we have simple random sampling (potentially of clusters, within strata) of $n$ units out of $N$ , then $\overset{ˇ}{Δ} = - (1 - n / N) / (n - 1) = - (1 - π_{i}) / (n - 1) .$ The last form is especially useful, because when we’re working with a particular stage of sampling we’re going to have the probabilities at that stage and the sample size at that stage conveniently available, but we might not have $N$ and $π_{i j}$ conveniently available.

Two quick survey items

Can we invent the case-control design?

Variances in two-phase designs