Linkage and multiple imputation - Biased and Inefficient

For a while, I’ve been thinking about multiple imputation and its extension to record linkage. Record linkage¹ is about taking two (or more) data sets and saying that record A in data set 1 is probably the same person as record 23 in data set 2. The probabilistic framework for it dates back to (or is at least routinly attributed to) Fellegi & Sunter (eg). It says that two records will show the same name and year of birth, say, either if the data are correct and they are the same person, or if the data are wrong and just happen to agree, and all the possibilities can be modelled. Well, they can be modelled once you add some identifiability assumptions, anyway².

The models give you a number for each pair of records saying how likely they are to be the same person. The classical thing to do with these numbers is to pick a threshold and treat any pair above the threshold as the same person and any pair below the threshold as two different people³. You might also use the number as a weight, so that identities you’re less sure about get less weight in the analysis.

If you think of the true linkage structure as a big binary matrix, this approach is basically single imputation of the binary matrix: you don’t know the matrix, but you plug in your best guess. Obviously better than single imputation would be a full Bayesian model. The big binary matrix is a set of unknown values, so it gets a model that goes on top of the linkage-error model, and the actual model you’re trying to fit to the data (eg, relationship between depression and unemployment adjusted for socio-economic status) goes on top of all that. Priors everywhere; MCMC; all the trimmings.

The first example of the full Bayesian approach I saw was a talk by Rebecca Steorts⁴ in 2014, which was basically this paper. This doesn’t actually incorporate the intended analysis model; it’s just a model for linkage and data error. The big advantage of separating the two is that the person doing the analysis and the person doing the linkage need very different expertise, so it’s convenient if they don’t have to be the same person. The disadvantage is that you introduce some attenuation of the analysis model by not including it in the linkage: it’s a bit like multiple imputation where you don’t use the outcome variable in the imputation model.

What the talk made me think about, though, was multiple imputation of linkage. Build a linkage model, estimate the posterior distributions, and sample some number \(M\) of sets of links from the posteriors. These sets of links, which require sensitive identifiers and MCMC expertise to create, can be distributed to people who don’t need to know anything more than Rubin’s rules for multiple-imputation analysis. The computation cost of using the multiple links isn’t trivial, but it is only linear in \(M\). There will be some attenuation of relationships in analyses because the analysis model wasn’t used for linking, but no more than in the classical approach.

In 2022, Mengdi Liu did an Honours project with me on this topic. Unfortunately, one of the early findings was this paper where a group of official statisticians⁵ had used a multiple-imputation-linkage approach on a real problem linking two large databases. It was still an interesting Honours project to work through finding and using software for ‘Bayesian entity resolution’ and do some simulations.

As you might expect, doing better than the classical single-imputation Fellegi-Sunter method is only possible if there are quite a lot of links that have high enough probability to be important but low enough to miss the threshold for linkage. It’s also necessary that the people with lower-probability links are different in the analysis model from those with higher-probability links, as otherwise the classical approach basically has data missing at random. But under those two conditions, incorporating linkage uncertainty with multiple imputation can make a noticeable difference.

The condition that hard-to-link people are different is very reasonable in practice: people with multiple versions of their name, or people with uncertain addresses, will tend to be different in other social attributes. It’s less clear (to me) how often the proportion of uncertain links is high enough to matter; the papers on Bayesian Entity Resolution give the impression that it’s important, but they would, wouldn’t they?

Multiple linkage probably won’t revolutionise anything, but it’s computationally reasonably feasible nowadays and I think it would be worth exploring a bit more. It would also be worth me getting ideas actually researched a bit faster.

or ‘entity resolution’↩︎
it’s easier if you ignore problems like the effect of ethnicity and country of birth on name errors↩︎
yes, there is a potential transitivity problem here↩︎
who gives excellent talks↩︎
sample of official statisticians? frame of official statisticians?↩︎