The Annals of Applied Statistics

Detecting duplicates in a homicide registry using a Bayesian partitioning approach

Mauricio Sadinle

Full-text: Open access

Abstract

Finding duplicates in homicide registries is an important step in keeping an accurate account of lethal violence. This task is not trivial when unique identifiers of the individuals are not available, and it is especially challenging when records are subject to errors and missing values. Traditional approaches to duplicate detection output independent decisions on the coreference status of each pair of records, which often leads to nontransitive decisions that have to be reconciled in some ad-hoc fashion. The task of finding duplicate records in a data file can be alternatively posed as partitioning the data file into groups of coreferent records. We present an approach that targets this partition of the file as the parameter of interest, thereby ensuring transitive decisions. Our Bayesian implementation allows us to incorporate prior information on the reliability of the fields in the data file, which is especially useful when no training data are available, and it also provides a proper account of the uncertainty in the duplicate detection decisions. We present a study to detect killings that were reported multiple times to the United Nations Truth Commission for El Salvador.

Article information

Source
Ann. Appl. Stat., Volume 8, Number 4 (2014), 2404-2434.

Dates
First available in Project Euclid: 19 December 2014

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1419001749

Digital Object Identifier
doi:10.1214/14-AOAS779

Mathematical Reviews number (MathSciNet)
MR3292503

Zentralblatt MATH identifier
06408784

Keywords
Deduplication duplicate detection distribution on partitions United Nations Truth Commission for El Salvador entity resolution homicide records Hispanic names human rights record linkage string similarity

Citation

Sadinle, Mauricio. Detecting duplicates in a homicide registry using a Bayesian partitioning approach. Ann. Appl. Stat. 8 (2014), no. 4, 2404--2434. doi:10.1214/14-AOAS779. https://projecteuclid.org/euclid.aoas/1419001749


Export citation

References

  • Bilenko, M., Mooney, R. J., Cohen, W. W., Ravikumar, P. and Fienberg, S. E. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems 18 16–23.
  • Buergenthal, T. (1994). The United Nations Truth Commission for El Salvador. Vanderbilt Journal of Transnational Law 27 497–544.
  • Buergenthal, T. (1996). La Comisión de la Verdad para El Salvador. In Estudios Especializados de Derechos Humanos I 11–62. Instituto Interamericano de Derechos Humanos, San José, Costa Rica.
  • Christen, P. (2005). Probabilistic data generation for deduplication and data linkage. In Proceedings of the Sixth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’05) 109–116. Springer, Berlin.
  • Christen, P. (2012a). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin.
  • Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24 1537–1555.
  • Christen, P. and Pudjijono, A. (2009). Accurate synthetic generation of realistic personal information. In Advances in Knowledge Discovery and Data Mining (T. Theeramunkong, B. Kijsirikul, N. Cercone and T.-B. Ho, eds.). Lecture Notes in Computer Science 5476 507–514. Springer, Berlin.
  • Christen, P. and Vatsalan, D. (2013). Flexible and extensible generation and corruption of personal data. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM 2013). ACM, New York.
  • Commission on the Truth for El Salvador (1993). From madness to hope: The 12-year war in El Salvador: Report of the Commission on the Truth for El Salvador. Available at http://www.usip.org/files/file/ElSalvador-Report.pdf [Accessed October 15, 2014]. UN Security Council.
  • Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal Complex Systems 1695.
  • Elmagarmid, A. K., Ipeirotis, P. G. and Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19 1–16.
  • Fay, R. E. (2004). An analysis of person duplication in census 2000. In Proceedings of the Section on Survey Research Methods 3478–3485. Amer. Statist. Assoc., Alexandria, VA.
  • Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183–1210.
  • Fernández, E. and García, A. M. (2003). Accuracy of referencing of Spanish names in Medline. The Lancet 361 351–352.
  • Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2001). On Bayesian record linkage. Researh in Official Statistics 4 185–198.
  • Fortini, M., Nuccitelli, A., Liseo, B. and Scanu, M. (2002). Modeling issues in record linkage: A Bayesian perspective. In Proceedings of the Section on Survey Research Methods 1008–1013. Amer. Statist. Assoc., Alexandria, VA.
  • Gutman, R., Afendulis, C. C. and Zaslavsky, A. M. (2013). A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Amer. Statist. Assoc. 108 34–47.
  • Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques. Springer, New York.
  • Hoover Green, A. (2011). Repertoires of violence against noncombatants: The role of armed group institutions and ideologies. Ph.D. thesis, Yale Univ.
  • Hsu, W., Lee, M. L., Liu, B. and Ling, T. W. (2000). Exploration mining in diabetic patients databases: Findings and conclusions. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’00) 430–436. ACM, New York.
  • Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84 414–420.
  • Keener, R., Rothman, E. and Starr, N. (1987). Distributions on partitions. Ann. Statist. 15 1466–1481.
  • Larsen, M. D. (2002). Comments on hierarchical Bayesian record linkage. In Proceedings of the Section on Survey Research Methods 1995–2000. Amer. Statist. Assoc., Alexandria, VA.
  • Larsen, M. D. (2005). Advances in record linkage theory: Hierarchical Bayesian record linkage theory. In Proceedings of the Section on Survey Research Methods 3277–3284. Amer. Statist. Assoc., Alexandria, VA.
  • Larsen, M. D. (2012). An experiment with hierarchical Bayesian record linkage. Preprint. Available at http://arxiv.org/abs/1212.5203.
  • Larsen, M. D. and Rubin, D. B. (2001). Iterative automated record linkage using mixture models. J. Amer. Statist. Assoc. 96 32–41.
  • Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley, Hoboken, NJ.
  • Lum, K., Price, M. E. and Banks, D. (2013). Applications of multiple systems estimation in human rights research. Amer. Statist. 67 191–200.
  • Marshall, L. (2008). Potential duplicates in the census: Methodology and selection of cases for followup. In Proceedings of the Section on Survey Research Methods 4237–4244. Amer. Statist. Assoc., Alexandria, VA.
  • Matsakis, N. E. (2010). Active duplicate detection with Bayesian nonparametric models. Ph.D. thesis, Massachusetts Institute of Technology.
  • McCullagh, P. (2011). Random permutations and partition models. In International Encyclopedia of Statistical Science 1170–1177. Springer, Berlin.
  • Miller, P. L., Frawley, S. J. and Sayward, F. G. (2000). IMM/Scrub: A domain-specific tool for the deduplication of vaccination history records in childhood immunization registries. Computers and Biomedical Research 33 126–143.
  • R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Rota, G.-C. (1964). The number of partitions of a set. Amer. Math. Monthly 71 498–504.
  • Ruiz-Pérez, R., López-Cózar, E. D. and Jiménez-Contreras, E. (2002). Spanish personal name variations in national and international biomedical databases: Implications for information retrieval and bibliometric studies. Journal of the Medical Library Association 90 411–430.
  • Sadinle, M. (2014). Supplement to “Detecting duplicates in a homicide registry using a Bayesian partitioning approach.” DOI:10.1214/14-AOAS779SUPP.
  • Sadinle, M. and Fienberg, S. E. (2013). A generalized Fellegi–Sunter framework for multiple record linkage with application to homicide record systems. J. Amer. Statist. Assoc. 108 385–397.
  • Sariyar, M. and Borg, A. (2010). The RecordLinkage package: Detecting errors in data. The R Journal 2 61–67.
  • Sariyar, M., Borg, A. and Pommerening, K. (2009). Evaluation of record linkage methods for iterative insertions. Methods Inf. Med. 48 429–437.
  • Sariyar, M., Borg, A. and Pommerening, K. (2012). Missing values in deduplication of electronic patient data. Journal of the American Medical Informatics Association 19 e76–e82.
  • Steorts, R. C., Hall, R. and Fienberg, S. E. (2013). A Bayesian approach to graphical record linkage and deduplication. Preprint. Available at http://arxiv.org/abs/1312.4645.
  • Tancredi, A. and Liseo, B. (2011). A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5 1553–1585.
  • Winkler, W. E. (1988). Using the EM algorithm for weight computation in the Fellegi–Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 667–671. Amer. Statist. Assoc., Alexandria, VA.
  • Winkler, W. E. (1989). Frequency-based matching in the Fellegi–Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 778–783. Amer. Statist. Assoc., Alexandria, VA.
  • Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 354–359. Amer. Statist. Assoc., Alexandria, VA.

Supplemental materials

  • Supplementary material: Supplement to “Detecting duplicates in a homicide registry using a Bayesian partitioning approach”. We provide a Gibbs sampler for the model presented in Section 3, a brief discussion on point estimation of the coreference partition, we explain how we standardized and compared Hispanic names and, finally, we present details on the implementation of the Gibbs sampler for the application in Section 4.