3 min read

Probabilities not bounded away from zero

We have a population or cohort of \(N\) people divided into \(H\) sampling strata, with a sample of size \(n_h\) taken from the population \(N_h\) in stratum \(h\). Let \(\pi_{ij}\) be the sampling probability for person \(i\) in stratum \(h\). When we do asymptotics we usually assume \(\pi_{ih}\) are bounded away from zero. That’s not ideal for, say, case-control studies of rare diseases, where we might want asymptotic approximations based on the case incidence being small (ie, converging to zero). 

In the situations where I’m interested in \(\pi_{ih}\) being small, it’s usually small for a whole stratum. Since sampling is independent between strata, there should be a central limit theorem separately for each stratum, and we should be able to add up the limiting Normal approximations for the stratum totals to get a Normal limit for the population total estimate and the population mean estimate. 

To formalise this,  suppose \(n_h\to\infty\) for every stratum (so that asymptotics makes sense), and that \(\pi_{ih}N_h/n_h\) is bounded above and below, so that within each stratum the sampling probability has a finite (relative) range. As a simple example, we might have a case stratum with \(\pi_i\approx 1\) and a control stratum with very small \(\pi_i\)

[Update: As Stas Kolenikov points out, I’m assuming the same strata are small large along the infinite sequence, so I need something like \(n_{h_1}/(n_{h_1}+n_{h_2})\to c_{h_1,h_2}\in [0,1]\) for each pair of strata.  This isn’t a meaningful loss of generality since (a) the infinite sequence is an analytic fiction and we might as well set it up for our maximum convenience; and (b) even without assuming anything, every subsequence will have a subsubsequence along which the condition holds]

By standard results, \(n_h^{1/2}(\bar X_{.h}-\mu_h)\stackrel{d}{\to} N(0,\sigma^2_h)\) for each stratum \(h\) , and by the Skorohod representation theorem we can find an \(H\)-variate normal vector \(\langle Z_h\rangle_{h=1}^H\) with
\[n_h^{1/2}(\bar X_{.h}-\mu_h)\stackrel{p}{\to} Z_h\]
(possibly on a different probability space), to get
\[\bar X_{.h}= \mu_h+ n_h^{-1/2}{Z_h}+o_p(n_h^{-1/2})\]
The \(Z_h\) will be independent, with mean zero; write \(\sigma^2_h\) for the variances. 

[Update: Note that \(\sigma^2_h\) is just \(\mathrm{var}[Z_h]\), nothing more fundamental. Under stratified random sampling, \(\sigma^2_h\) will be \(\mathrm{var}[X]\) in stratum \(h\) multiplied by the ‘finite population correction” \((N_h-n_h)/N_h\), but under other sampling schemes it will be something else]

\[\bar X_{..} = \frac{1}{N}\sum_{h=1}^H N_h\bar X_{.h}\]
\[\begin{align*} \bar X_{..} &=\sum_{h=1}^H \frac{N_h}{N}\mu_h+\frac{N_hn_h^{-1/2}}{N}Z_h+o_p\left(\frac{N_hn_h^{-1/2}}{N} \right)\\ &=\mu+\left(\sum_{h=1}^H\frac{N_hn_h^{-1/2}}{N}Z_h\right)+o_p\left(\sum_{h=1}^H\frac{ N_h}{N\sqrt{n_h}}\right) \end{align*}\]

First, suppose $ N_h/N$ converges to a non-zero constant for each \(h\). Let \(n_*=\min_h n_h\) and define \({\mathcal H}=\{h: \lim n_*/n_h>0\}\)
\[\begin{eqnarray*} \bar X_{..} &= &\mu+\left(\sum_{h=1}^H\frac{N_hn_h^{-1/2}}{N}Z_h\right)+o_p\left(\frac{\max_h N_h}{N\min_h \sqrt{n_h}}\right)\\  &= &\mu+\left(\sum_{h\in{\mathcal H}}\frac{N_hn_*^{-1/2}}{N}Z_h\right)+\sum_{h\not\in{\mathcal H}} o_p(n_*^{-1/2})+o_p\left(\frac{\max_h N_h}{N\sqrt{n_*}}\right)\\ &=& \mu+ n_*^{-1/2}Z+o_p(n_*^{-1/2}) \\ \end{eqnarray*}\]

where \(Z\sim N(0, \sigma^2)\) with
\[\sigma^2=\lim_{n_*\to\infty} \sum_{h\in{\mathcal H}} \frac{N_h^2n_*\sigma^2_h}{N^2n_h}\]

Alternatively, for case–control sampling we may have \(N_h/N\to 0\) in the case stratum, but we would have \(n_h\) all of the same order, and so of the same order as their total, \(n\). The limiting distribution is dominated by the largest strata: define \({\mathcal H}'=\{h: \lim N_h/N>0\}\) (which is non-empty as \(H\) is finite)

\[\begin{eqnarray*} \bar X_{..} &=&\mu+\left(\sum_{h=1}^H\frac{N_hn_h^{-1/2}}{N}Z_h\right)+o_p\left(\sum_{h=1}^H\frac{ N_h}{N\sqrt{n_h}}\right)\\ &=&\mu+\left(\sum_{h\in{\mathcal H}'}\frac{N_hn^{-1/2}}{N}Z_h\right)+\sum_{h\not\in{\mathcal H}'} o_p(n^{-1/2})+o_p\left(n^{-1/2}\right)\\\ &=& \mu+ n^{-1/2}Z+o_p(n^{-1/2})\\ \end{eqnarray*}\]
where \(Z\sim N(0, \sigma^2)\) with
\[\sigma^2=\lim_{n\to\infty} \sum_{h\in{\mathcal H}} \frac{N_h^2n\sigma^2_h}{N^2n_h}\]

Weaker conditions on \(N_h\) and \(n_h\) are clearly possible: it is only necessary to identify which terms dominate the limiting distribution of \(\bar X_{..}\), since the limiting distribution of estimated stratum totals is always independent \(H\)-variate Normal under appropriate scaling.