Bayesian Analysis

A Unified Framework for De-Duplication and Population Size Estimation

Andrea Tancredi, Rebecca Steorts, and Brunero Liseo

Advance publication

This article is in its final form and can be cited using the date of online publication and the DOI.

Full-text: Open access

Abstract

Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of N different entities. The main novelty of our approach is to consider the population size N as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.

Article information

Source
Bayesian Anal., Advance publication (2018), 26 pages.

Dates
First available in Project Euclid: 7 March 2019

Permanent link to this document
https://projecteuclid.org/euclid.ba/1551949260

Digital Object Identifier
doi:10.1214/19-BA1146

Keywords
cluster analysis entity resolution partition models record linkage

Rights
Creative Commons Attribution 4.0 International License.

Citation

Tancredi, Andrea; Steorts, Rebecca; Liseo, Brunero. A Unified Framework for De-Duplication and Population Size Estimation. Bayesian Anal., advance publication, 7 March 2019. doi:10.1214/19-BA1146. https://projecteuclid.org/euclid.ba/1551949260


Export citation

References

  • Belin, T. and Rubin, D. (1995). “A method for calibrating false - match rates in record linkage.” Journal of the American Statistical Association, 90: 694–707.
  • Booth, J. G., Casella, G., and Hobert, J. P. (2008). “Clustering using objective functions and stochastic search.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1): 119–139.
  • Briscolini, D., Di Consiglio, L., Liseo, B., Tancredi, A., and Tuoto, T. (2018). “New methods for Small Area Estimation with Linkage Uncertainty.” International Journal of Approximate Reasoning, 94: 30–42.
  • Chen, B., Shrivastava, A., and Steorts, R. C. (2018). “Unique Entity Estimation with Application to the Syrian Conflict.” Annals of Applied Statistics, 12: 1039–1067.
  • Copas, J. and Hilton, F. (1990). “Record linkage: statistical models for matching computer records.” Journal of the Royal Statistical Society, A, 153: 287–320.
  • Devroye (1986). Non-Uniform Random Variate Generation. Springer-Verlag.
  • Fellegi, I. and Sunter, A. (1969). “A theory of record linkage.” Journal of the American Statistical Association, 64: 1183–1210.
  • Fortini, M., Liseo, B., Nuccitelli, A., and Scanu, M. (2001). “On Bayesian record linkage.” Research in Official Statistics, 4: 185–198.
  • George, E. I. and Robert, C. P. (1992). “Capture recapture estimation via Gibbs sampling.” Biometrika, 79(4): 677–683.
  • Jaro, M. (1989). “Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida.” Journal of the American Statistical Association, 84: 414–420.
  • Johndrow, J. E., Lum, K., and Dunson, D. B. (2018). “Theoretical limits of record linkage and microclustering.” Biometrika, 105: 431–446.
  • Larsen, M. (2005). “Advances in Record Linkage Theory: Hierarchical Bayesian Record Linkage Theory.” Proceedings of the Section on Survey Research Methods, American Statistical Association, 3277–3283.
  • Larsen, M. D. and Rubin, D. (2001). “Iterative automated record linkage using mixture models.” Journal of the American Statistical Association, 96: 32–41.
  • Liseo, B. and Tancredi, A. (2011). “Bayesian estimation of population size via linkage of multivariate normal data sets.” Journal of Official Statistics, 27: 491–505.
  • MacEachern, S. N. (1994). “Estimating normal means with a conjugate style Dirichlet process prior.” Communications in Statistics-Simulation and Computation, 23(3): 727–741.
  • Marin, J.-M. and Robert, C. P. (2014). Bayesian essentials with R. Springer.
  • McCullagh, P. and Yang, J. (2008). “How many clusters?” Bayesian Analysis, 3(1): 101–120.
  • Neal, R. M. (2000). “Markov chain sampling methods for Dirichlet process mixture models.” Journal of Computational and Graphical Statistics, 9: 249–265.
  • Papaspiliopoulos, O., Roberts, G., and Skold, M. (2003). “Non-centered parameterisations for hierarchical models and data augmentation.” Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, vol. 307, Oxford University Press, USA
  • Pitman, J. (2006). Combinatiorial Stochastic Processes. Ecole d’Eté de Probabilités de Saint-Flour XXXII, Lecture Notes in Mathematics, vol. 1875, Berlin, Springer.
  • Sadinle, M. (2014). “Detecting duplicates in a homicide registry using a Bayesian partitioning approach.” The Annals of Applied Statistics, 8(4): 2404–2434.
  • Sadinle, M. (2017). “Bayesian Estimation of Bipartite Matchings for Record Linkage.” Journal of the American Statistical Association, 112: 600–612.
  • Sadinle, M. (2018). “Bayesian propagation of record linkage uncertainty into population size estimation of human rights violations.” The Annals of Applied Statistics, 12(2): 1013–1038.
  • Sadinle, M. and Fienberg, S. E. (2013). “A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems.” Journal of the American Statistical Association, 108(502): 385–397.
  • Steorts, R. C. (2015). “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis, 10(4): 849–875.
  • Steorts, R. C., Hall, R., and Fienberg, S. E. (2014). “SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication.” Journal of Machine Learning Research, 33: 922–930.
  • Steorts, R. C., Hall, R., and Fienberg, S. E. (2016). “A Bayesian approach to graphical record linkage and de-duplication.” Journal of the American Statistical Association: Theory and Methods, 111(516): 1660–1672.
  • Tancredi, A. and Liseo, B. (2011). “A hierarchical Bayesian approach to record linkage and population size problems.” Annals of Applied Statistics, 5: 1553–1585.
  • Tancredi, A. and Liseo, B.(2015). “Regression Analysis with linked data: Problems and possible solutions.” Statistica, 75(1): 19–35.
  • Tancredi, A., Steorts, R., and Liseo, B. (2019). “Supplementary Material for “A Unified Framework for De-Duplication and Population Size Estimation”.” Bayesian Analysis.
  • Wang, X., He, C. Z., and Sun, D. (2007). “Bayesian population estimation for small sample capture-recapture data using noninformative priors.” Journal of Statistical Planning and Inference, 137(4): 1099–1118.
  • Zanella, G., Betancourt, B., Wallach, H., Miller, J., Zaidi, A., and Steorts, R. C. (2016). “Flexible Models for Microclustering with Application to Entity Resolution.” Neural Information Processing Systems.

Supplemental materials