Probabilities not bounded away from zero

We have a population or cohort of $N$ people divided into $H$ sampling strata, with a sample of size $n_h$ taken from the population $N_h$ in stratum $h$. Let $\pi_{ij}$ be the sampling probability for person $i$ in stratum $h$. When we do asymptotics we usually assume $\pi_{ih}$ are bounded away from zero. That’s not ideal for, say, case-control studies of rare diseases, where we might want asymptotic approximations based on the case incidence being small (ie, converging to zero).

In the situations where I’m interested in $\pi_{ih}$ being small, it’s usually small for a whole stratum. Since sampling is independent between strata, there should be a central limit theorem separately for each stratum, and we should be able to add up the limiting Normal approximations for the stratum totals to get a Normal limit for the population total estimate and the population mean estimate.

To formalise this, suppose $n_h\to\infty$ for every stratum (so that asymptotics makes sense), and that $\pi_{ih}N_h/n_h$ is bounded above and below, so that within each stratum the sampling probability has a finite (relative) range. As a simple example, we might have a case stratum with $\pi_i\approx 1$ and a control stratum with very small $\pi_i$.

[Update: As Stas Kolenikov points out, I’m assuming the same strata are small large along the infinite sequence, so I need something like $n_{h_1}/(n_{h_1}+n_{h_2})\to c_{h_1,h_2}\in [0,1]$ for each pair of strata. This isn’t a meaningful loss of generality since (a) the infinite sequence is an analytic fiction and we might as well set it up for our maximum convenience; and (b) even without assuming anything, every subsequence will have a subsubsequence along which the condition holds]

By standard results, $n_h^{1/2}(\bar X_{.h}-\mu_h)\stackrel{d}{\to} N(0,\sigma^2_h)$ for each stratum $h$ , and by the Skorohod representation theorem we can find an $H$-variate normal vector $\langle Z_h\rangle_{h=1}^H$ with
\[n_h^{1/2}(\bar X_{.h}-\mu_h)\stackrel{p}{\to} Z_h\]
(possibly on a different probability space), to get
\[\bar X_{.h}= \mu_h+ n_h^{-1/2}{Z_h}+o_p(n_h^{-1/2})\]
The $Z_h$ will be independent, with mean zero; write $\sigma^2_h$ for the variances.

[Update: Note that $\sigma^2_h$ is just $\mathrm{var}[Z_h]$, nothing more fundamental. Under stratified random sampling, $\sigma^2_h$ will be $\mathrm{var}[X]$ in stratum $h$ multiplied by the ‘finite population correction” $(N_h-n_h)/N_h$, but under other sampling schemes it will be something else]

Now,
\[\bar X_{..} = \frac{1}{N}\sum_{h=1}^H N_h\bar X_{.h}\]
giving
\[\begin{align*} \bar X_{..} &=\sum_{h=1}^H \frac{N_h}{N}\mu_h+\frac{N_hn_h^{-1/2}}{N}Z_h+o_p\left(\frac{N_hn_h^{-1/2}}{N} \right)\\ &=\mu+\left(\sum_{h=1}^H\frac{N_hn_h^{-1/2}}{N}Z_h\right)+o_p\left(\sum_{h=1}^H\frac{ N_h}{N\sqrt{n_h}}\right) \end{align*}\]

First, suppose $ N_h/N$ converges to a non-zero constant for each $h$. Let $n_*=\min_h n_h$ and define ${\mathcal H}=\{h: \lim n_*/n_h>0\}$
\[\begin{eqnarray*} \bar X_{..} &= &\mu+\left(\sum_{h=1}^H\frac{N_hn_h^{-1/2}}{N}Z_h\right)+o_p\left(\frac{\max_h N_h}{N\min_h \sqrt{n_h}}\right)\\ &= &\mu+\left(\sum_{h\in{\mathcal H}}\frac{N_hn_*^{-1/2}}{N}Z_h\right)+\sum_{h\not\in{\mathcal H}} o_p(n_*^{-1/2})+o_p\left(\frac{\max_h N_h}{N\sqrt{n_*}}\right)\\ &=& \mu+ n_*^{-1/2}Z+o_p(n_*^{-1/2}) \\ \end{eqnarray*}\]

where $Z\sim N(0, \sigma^2)$ with
\[\sigma^2=\lim_{n_*\to\infty} \sum_{h\in{\mathcal H}} \frac{N_h^2n_*\sigma^2_h}{N^2n_h}\]

Alternatively, for case–control sampling we may have $N_h/N\to 0$ in the case stratum, but we would have $n_h$ all of the same order, and so of the same order as their total, $n$. The limiting distribution is dominated by the largest strata: define ${\mathcal H}'=\{h: \lim N_h/N>0\}$ (which is non-empty as $H$ is finite)

\[\begin{eqnarray*} \bar X_{..} &=&\mu+\left(\sum_{h=1}^H\frac{N_hn_h^{-1/2}}{N}Z_h\right)+o_p\left(\sum_{h=1}^H\frac{ N_h}{N\sqrt{n_h}}\right)\\ &=&\mu+\left(\sum_{h\in{\mathcal H}'}\frac{N_hn^{-1/2}}{N}Z_h\right)+\sum_{h\not\in{\mathcal H}'} o_p(n^{-1/2})+o_p\left(n^{-1/2}\right)\\\ &=& \mu+ n^{-1/2}Z+o_p(n^{-1/2})\\ \end{eqnarray*}\]
where $Z\sim N(0, \sigma^2)$ with
\[\sigma^2=\lim_{n\to\infty} \sum_{h\in{\mathcal H}} \frac{N_h^2n\sigma^2_h}{N^2n_h}\]

Weaker conditions on $N_h$ and $n_h$ are clearly possible: it is only necessary to identify which terms dominate the limiting distribution of $\bar X_{..}$, since the limiting distribution of estimated stratum totals is always independent $H$-variate Normal under appropriate scaling.