4 min read

Changing strata mid-stream

The problem

We want to estimate the population (or cohort) total of a variable X (actually, we don’t, we want to fit a regression model, but this part of the maths is the same). We’ve got some variables Z that we know for everyone, and we want to do clever sampling.

Thanks to Neyman, we know that if we divide the population into strata of size Nk using Z, the optimal sample size nk in each stratum is proportional to Nkσk, where σk is the variance of X in the stratum. We are happy!

But we don’t know σk. We are sad.

We could do a little pilot study and then estimate σk, and then do more sampling. We are moderately happy.

It’s inefficient to just throw the pilot data away, so we’d want to treat the pilot as the first wave of sampling. We can then work out how many more people to sample in each stratum (in wave 2) to end up with the estimated optimal nk. McIsaac & Cook did this for some regression models, though they didn’t realise (or, at least, didn’t mention) the connection to Neyman allocation.

But we might find out in wave 1 that we have a bad set of strata (sad!) and want to change the stratification in later waves. Is this allowed?

Maths and notation

After waves m we end up with Km strata with nkm in the sample and Nkm in the population. The nkm are chosen before the last wave.

Technically, each observation has a sampling probability πi(α^) that depends on the stratifications and sampling probabilities at each wave. Can we ignore these and just work with the realised sampling fractions πi=nkm(i)/Nkm(i)?

The πi are random, so dividing by them will in general introduce bias. However, the estimator would be unbiased for any fixed set of sampling probabilities, and so the bias will be of second order in the variance of the sampling fractions. As long as the nk are not too small, that aspect of the problem will be ok. We still have to worry about the changing stratifications and sampling probabilities over waves.

It is certainly possible to break this approach. Suppose Z contains two variables, an excellent surrogate Z1 for X and an irrelevant variable Z2, and that everything is binary. Further suppose that in wave 1 we stratify based on Z1 and oversample Z1=1, and in wave 2 we stratify based on Z2 and take a balanced sample. We will end up with equal sampling probabilities for every individual (because balanced), but the final design will have oversampled Z1=1 and thus X=1, and so will have a biased estimate for the mean or total of X. No-one would do this, but it indicates the approach isn’t safe without at least some assumptions.

In the other direction, we know everything is fine if the strata change but the sampling probabilities change over the waves. Also, if the strata start off coarse and are refined at each stage, we’re ok: two observations will never end up with the same πi if they are sampled with different probabilities.

What this suggests is that you can make the strata better at each wave, but not worse.

Suppose that once you know what stratum someone is in at the last wave, it doesn’t matter what stratum they are in at earlier waves E[Xi|istratum k3]=E[Xi|istratum k3,istratum k2,istratum k1] We can correct the sampling bias in the LHS of this equation by reweighting using 1/πi, so we can correct the bias in the RHS, which describes the actual sampling bias.

The equation holds trivially if the strata are the same at each wave, and pretty obviously if the strata get refined at each wave. It can also hold if you just get better information at each wave and make better strata. For example, in our real problem, instead of X we are interested in the influence functions for a regression model, and at each wave we get better estimates of these influence functions and use them to stratify better.