3 min read

Stage vs phase

When two-phase study designs started being used in epidemiology and biostatistics there was a period of conflict. Survey statisticians insisted on the term “two-phase” and biostatisticians (following survey textbooks in some cases) wanted to call these “two-stage” designs. Like the correct pronounciation of ‘Scheveningen’1, the terminology identified communities.

In a K-stage survey design we have sampling units (clusters) at stage 1, smaller ones at stage 2, and so on. You can compute a probability πi=πi,1×πi,2|1××πi,K|K1, where πi,1 is the probability that unit i is sampled at stage 1, πi,2|1 is the probability that unit i is sampled at stage 2 given that it is sampled at stage 1, and so on. The probabilities are all known constants and πi is the marginal probability that unit i is sampled.

In a K-phase survey design we have sampling units (clusters) at stage 1, other units at stage 2, and so on. You can compute a number πi=πi,1×πi,2|1××πi,K|K1, where πi,1 is the probability that unit i is sampled at phase 1, πi,2|1 is the probability that unit i is sampled at stage 2 given the phase-1 data and so on. The probabilities may depend on the entire data for the previous phases and so are random variables, so πi is (in general) not the marginal probability that unit i is sampled.

It’s easy to see that multistage sampling is a special case of multiphase sampling; it’s what you get if you use only the information unit i was in phase k1 in defining πi,k|k1. The simplest application of two-phase sampling that isn’t two-stage is when you want to stratify on variables that aren’t available for the whole population. You can measure those variable at phase 1 and then stratify the sampling of phase 2 on them. That’s how two-phase sampling is typically used in health research.

In some ways the distinction doesn’t matter. Suppose we write Ri for the indicator that unit i is sampled. The key property of multi-phase sampling is that E[Ri/πi]=1, just as E[Ri/πi]=1 for multistage sampling. The computational formulas for multiphase sampling are conceptually quite different from those for multistage sampling, but practically very similar: you get them by simply putting *s on all the probabilities.

This does raise one modestly interesting question: if πi and πi are different, can we say anything about which one is better? This is a theoretical question: in practice you usually can’t compute πi because it involves averaging over all possible samples at intermediate phases. It’s still an interesting question. You could argue that using π was better because handwaving about conditioning, or you could argue that using π was better because handwaving about random variation. The answer doesn’t seem to be known.


  1. /’sxeːvənɪŋə/, not /’ʃeːvənɪŋən/. Yes, you would have been shot as a spy↩︎