Attention conservation notice: an attempt to get a small number of other people, probably not including you, to adopt our notation.
Together with groups at the University of Pennsylvania and Vanderbilt, we have been working on methods for the design and analysis of two-phase samples, samples taken from an existing cohort or database to measure new variables. The problem combines measurement-error, missing-data, and sampling ideas, so questions of notation can get fraught. For example, there are otherwise reasonable people who would like to be something other than a vector of weights.
Here is an attempt at notation
- We sample observations from a cohort of , where the th observation is sampled with known probability . Often the sampling is independent (or independent except for fixed ); if not, we also know the pairwise probability that both and were sampled.
- The sampling weights are or adjusted versions of this to incorporate cohort-level information
- We have variables , , and (typically) measured for everyone in the cohort and measured on the subsample.
- is the indicator that observation is in the subsample, so .
- The outcome model is for . It is the model we would fit if we had complete data. Its parameters are ; its loglikelihood is ; its score function is
- The imputation model is for . Its parameters are . It may be used to produce single imputations or multiple imputations or for .
- The phase-1 model is for . It has influence functions . Or for multiple imputation it is for and has influence functions
- We use the term raking (or generalised raking) for the adjusted-weight estimators, to avoid confusion with the unrelated ‘regression calibration’ technique in the measurement-error literature. But we still call the equations that constrain the adjusted weights ‘the calibration constraints’.
- On occasion, we may use and for elements of that are versions of and measured with error, because tradition. Obviously we won’t use the stars to indicate multiple imputation when we do.
The literature has not really made a consistent choice between and , though there is a tendency in measurement-error papers for to be the true predictor value, which fits our notation. The distinction between and is that would be in the outcome model even if you had for everyone, and would not. In a classical measurement error approach, the mismeasured covariate would uninteresting if you had the true value, so it would be an , not a .
When isn’t measured on everyone (eg, is measured with error on the whole cohort and accurately on the subsample), the imputation model doesn’t have on the RHS.