The Annals of Applied Statistics

Categorical data fusion using auxiliary information

Bailey K. Fosdick, Maria DeYoreo, and Jerome P. Reiter

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


In data fusion, analysts seek to combine information from two databases comprised of disjoint sets of individuals, in which some variables appear in both databases and other variables appear in only one database. Most data fusion techniques rely on variants of conditional independence assumptions. When inappropriate, these assumptions can result in unreliable inferences. We propose a data fusion technique that allows analysts to easily incorporate auxiliary information on the dependence structure of variables not observed jointly; we refer to this auxiliary information as glue. With this technique, we fuse two marketing surveys from the book publisher HarperCollins using glue from the online, rapid-response polling company CivicScience. The fused data enable estimation of associations between people’s preferences for authors and for learning about new books. The analysis also serves as a case study on the potential for using online surveys to aid data fusion.

Article information

Ann. Appl. Stat., Volume 10, Number 4 (2016), 1907-1929.

Received: June 2015
Revised: December 2015
First available in Project Euclid: 5 January 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Imputation integration latent class matching


Fosdick, Bailey K.; DeYoreo, Maria; Reiter, Jerome P. Categorical data fusion using auxiliary information. Ann. Appl. Stat. 10 (2016), no. 4, 1907--1929. doi:10.1214/16-AOAS925.

Export citation


  • D’Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
  • D’Orazio, M., Di Zio, M. and Scanu, M. (2002). Statistical matching and official statistics. Rivista di Statistica Ufficiale 1 5–24.
  • Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
  • Fosdick, B., DeYoreo, M. and Reiter, J. (2016). Supplement to “Categorical data fusion using auxiliary information.” DOI:10.1214/16-AOAS925SUPP.
  • Gibbs, A. and Su, F. (2002). On choosing and bounding probability metrics. Int. Stat. Rev. 70 419–435.
  • Gilula, Z. and McCulloch, R. (2013). Multi level categorical data fusion using partially fused data. Quantitative Marketing and Economics 11 353–377.
  • Gilula, Z., McCulloch, R. and Rossi, P. (2006). A direct approach to data fusion. Journal of Marketing Research 43 73–83.
  • Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 215–231.
  • Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
  • Ishwaran, H. and Zarepour, M. (2000). Markov chain Monte Carlo in approximate Dirichlet and beta two-parameter process hierarchical models. Biometrika 87 371–390.
  • Kadane, J. B. (2001). Some statistical problems in merging data files. Journal of Official Statistics 17 423–433.
  • Kamakura, W. and Wedel, M. (1997). Statistical data fusion for cross tabulation. Journal of Marketing Research 34 485–498.
  • Kamakura, W., Wedel, M., de Rosa, F. and Mazzon, J. A. (2003). Cross-selling through database marketing: A mixed data factor analyzer for data augmentation and prediction. International Journal of Research in Marketing 20 45–65.
  • Kiesl, H. and Rässler, S. (2006). How valid can data fusion be? IAB Discussion Paper, 15.
  • Moriarity, C. and Scheuren, F. (2003). A note on Rubin’s statistical matching using file concatenation with adjusted weights and multiple imputations. J. Bus. Econom. Statist. 21 65–73.
  • Moriarty, C. and Scheuren, F. (2001). Statistical matching: A paradigm for assessing the uncertainty in the procedure. Journal of Official Statistics 17 407–422.
  • Pollard (2002). A User’s Guide to Measure Theoretic Probability. Cambridge Univ. Press, Cambridge.
  • Rässler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications, and Alternative Bayesian Approaches. Lecture Notes in Statistics 168 60–63. Springer, New York.
  • Rässler, S. (2004). Data fusion: Identification problems, validity, and multiple imputation. Austrian Journal of Statistics 33 153–171.
  • Reiter, J. P. (2012). Bayesian finite population imputation for data fusion. Statist. Sinica 22 795–811.
  • Rodgers, W. L. (1994). An evaluation of statistical matching. J. Bus. Econom. Statist. 2 91–102.
  • Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581–592.
  • Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted weights and multiple imputations. J. Bus. Econom. Statist. 4 87–94.
  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York.
  • Schifeling, T. A. and Reiter, J. P. (2016). Incorporating marginal prior information in latent class models. Bayesian Anal. 11 499–518.
  • Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639–650.
  • Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38 499–521.
  • van Hattum, P. and Hoijtink, H. (2008). The proof of the pudding is in the eating. Data fusion: An application in marketing. Journal of Database Marketing & Customer Strategy Management 15 267–284.
  • van der Putten, P., Kok, J. N. and Gupta, A. (2002). Data fusion through statistical matching. Working paper 4342-02. MIT Sloan School of Management, Cambridge, MA.
  • Vermunt, J., Ginkel, J., der Ark, L. and Sijtsma, K. (2008). Multiple imputation of incomplete categorical data using latent class analysis. Sociological Methodology 38 369–397.
  • Wicken, G. and Elms, S. (2009). Demystifying data fusion—The “why?”, the “how?” and the “wow!” Technical report, Advertising Research Foundation Week of Workshops, New York.

Supplemental materials

  • Model checking and MCMC diagnostics. Model goodness-of-fit checks to the HarperCollins and CivicScience data and MCMC convergence diagnostics results.