Countermatching - Biased and Inefficient

Countermatching is a simple case-control sampling mechanism that makes people uncomfortable when they first encounter it. Get ready.

Suppose you want to study the effect of a relatively rare exposure (sufficiently high dose radiation to the heart) on a relatively rare outcome (heart failure in breast cancer survivors). If you just took a random sample of the population there would be very few breast cancer survivors, so you work with a cohort of breast-cancer survivors. But even if you took a random sample of them, not that many would have heart failure. We’ve known how to handle this for generations: you take a sample of cases, and then a sample of about as many ‘controls’, people without the disease.

Now the problem will likely be that the cases are older than the controls, since heart failure gets more common as you get older. If we’re not interested in age we can match the cases and controls. For each case, find a breast cancer survivor of the same age but without heart failure as a control, and treat the two as a pair. If one of the pair is exposed and the other is unexposed, the pair gives information about the association between exposure and disease.

Even with matching on age, there’s a problem with the rare exposure. In many, perhaps most pairs, neither the case nor the control will be exposed. These pairs are uninformative about the association. Wouldn’t it be nice if we could ensure that every pair was informative?

The only way to make every pair informative is to match an unexposed control with every exposed case, and an exposed control with every unexposed case. There are two obvious problems. First, we’d need to know the exposure for everyone in the population to do this, so why would we even be sampling? Second, that’s just weird and will mess up the association between exposure and disease.

To address the first problem, while we can’t do this exactly, we might have a reasonable guess as to exposure for everyone in the population. In the breast cancer example (which I heard in a talk by Bryan Langholz) they knew which side the tumour had been on, and whether the woman had received radiotherapy. There would be no exposure without radiotherapy or for right-sided tumours; there would probably be some exposure for radiotherapy in left-sided tumours. The researchers could choose a definitely-unexposed control for each possible-exposed case, and a possibly-exposed control for every definitely-unexposed case. Once they had the sample they could look up the exact direction and intensity used for each radiation beam and work out the exact dose to the heart, but it wasn’t feasible to do this for all the women in their study population.

So, we can countermatch on a surrogate exposure to get a more informative matched case-control sample, but apparently at the cost of completely stuffing up the association we’re trying to measure. That’s why the idea initially makes people uncomfortable, especially epidemiologists, who have been trained in all the terrible things that can happen in a case-control study when the sampling is biased. Fortunately, we know exactly how the sampling is biased, because we did it. We can correct the bias by reweighting the data.

The usual way to reweight the data works with each countermatched pair separately. The weight depends on the ratio of the number of potential controls in the (surrogate) exposed and unexposed groups, and the resulting weighted likelihood is a genuine Cox-style partial likelihood.

Another way to reweight the data is to break the matches, so a case is compared to all the sampled people, both cases and controls, at the same age. Sven Ove Samuelsen came up with this idea for matched case-control studies. (I thought of it independently, but more than ten years too late.) Claudia Rivera and I have recently extended it to countermatched designs.

In this analysis the weight for a case is just 1, the weight for a control is the reciprocal of the sampling probability of that control – not just in the matched pair where she was sampled, but cumulatively across all matched pairs. For this to work, you need to be able to work out exposure for each sampled person at a range of ages, not just at a single age. In the breast-cancer example that’s fine.

This weighted likelihood isn’t a partial likelihood, it’s one of the second-rate survey pseudolikelihoods. Typically, the pseudollikelihood analysis is less efficient, despite taking more work and requiring more data, so why would anyone use it? Unlike the partial likelihood, the survey pseudolikelihood allows calibration of weights to bring in information on any relevant variables from all the unsampled people in your study cohort, and that can more than pay for the effort.