We have a paper coming out in *Biometrics* describing a reasonably complicated example of validation sampling from electronic health records. “We” here is a lot of people: Bryan Shepherd, Kyunghee Han, Tong Chen, Aihua Bian, Shannon Pugh, Stephany Duda, Thomas Lumley, William Heerman, Pamela Shaw. Pam and Bryan led the project from the statistical side, William Heerman was the clinical lead.

The project was to see how mothers’ weight gain during pregnancy affected the risk of asthma and childhood obesity in kids. We had electronic health record data on about 10,000 mother-child pairs, and we could afford to validate the data on about 1000 of them. In this case, validating the data means checking with the original electronic records of consultations.

So, the question is which records do you choose for validation, and what analysis do you then use. We did what biostatisticians would call AIPW estimation and survey statisticians would call raking, where we do a weighted analysis of the 1000 validated pairs but with weights that incorporate information from the whole cohort. The design was a bit complicated. If \(N_k\) is the number of pairs in stratum \(k\) in the whole cohort, and \(\sigma^2_k\) is the variance of the influence functions for the parameter of interest in stratum \(k\), the optimal sample size is proportional to \(N_k\sigma_k\)^{1}, and the strata should ideally be chosen so this sample size is equal. Fortunately, we did have a single parameter of primary interest in the model, so we can optimise without going into the whole alphabet of A-optimal, C-optimal, D-optimal, E-optimal, G-optimal, I-optimal and so on

We don’t actually know \(\sigma^2_k\), because it depends on the parameters we’re trying to estimate. Since this is a validation study, we have unvalidated measures of all the variables available, so we can work out what design would be optimal if these unvalidated measures were perfect. We then sample a few hundred records, update our estimates, update our designs, and repeat.

The design is more complicated because we actually have two parameters of primary interest: one for obesity and one for asthma. We used the data sampled for obesity to estimate the optimal design for asthma, and took it as an independent sample (which thus had some overlap with the obesity sample, but that’s a solvable problem). Finally, we did all the analyses, and it looks as though there’s a bit of benefit from the validation and general complicatedness. We expect there will be more benefit in settings where there’s more error in the electronic health records.

All hail Neyman↩︎