The Annals of Applied Statistics

Clustering South African households based on their asset status using latent variable models

Damien McParland, Isobel Claire Gormley, Tyler H. McCormick, Samuel J. Clark, Chodziwadziwa Whiteson Kabudula, and Mark A. Collinson

Full-text: Open access


The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status.

A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure—this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).

The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.

Article information

Ann. Appl. Stat., Volume 8, Number 2 (2014), 747-776.

First available in Project Euclid: 1 July 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Clustering mixed data item response theory Metropolis-within-Gibbs


McParland, Damien; Gormley, Isobel Claire; McCormick, Tyler H.; Clark, Samuel J.; Kabudula, Chodziwadziwa Whiteson; Collinson, Mark A. Clustering South African households based on their asset status using latent variable models. Ann. Appl. Stat. 8 (2014), no. 2, 747--776. doi:10.1214/14-AOAS726.

Export citation


  • Aguilar, O. and West, M. (2000). Bayesian dynamic factor models and portfolio allocation. J. Bus. Econom. Statist. 18 338–357.
  • Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669–679.
  • Alkema, L., Faye, O., Mutua, M. and Zulu, E. (2008). Identifying poverty groups in Nairobi’s slum settlements: A latent class analysis approach. In Conference Paper for Annual Meeting of the Population Association of America. New Orleans.
  • Bensmail, H., Celeux, G., Raftery, A. E. and Robert, C. P. (1997). Inference in model-based cluster analysis. Statist. Comput. 7 1–10.
  • Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika 98 291–306.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York.
  • Browne, R. P. and McNicholas, P. D. (2012). Model-based clustering, classification, and discriminant analysis of data with mixed type. J. Statist. Plann. Inference 142 2976–2984.
  • Cai, J.-H., Song, X.-Y., Lam, K.-H. and Ip, E. H.-S. (2011). A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput. Statist. Data Anal. 55 2889–2907.
  • Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. J. Amer. Statist. Assoc. 95 957–970.
  • Chib, S., Greenberg, E. and Chen, Y. (1998). MCMC methods for fitting and comparing multinomial response models. Technical report, Washington Univ. in St. Louis.
  • Collinson, M. A., Clark, S. J., Gerritsen, A. A. M., Byass, P., Kahn, K. and Tollmann, S. M. (2009). The dynamics of poverty and migration in a rural south african community, 2001–2005. Technical report, Center for Statistics and the Social Sciences Univ. of Washington.
  • Cowles, M. K. (1996). Accelerating Monte Carlo Markov chain convergence for cumulative-link generalized linear models. Statist. Comput. 6 101–111.
  • Erikson, R. and Goldthorpe, J. H. (1992). The Constant Flux: A Study of Class Mobility in Industrial Societies. Oxford Univ. Press, London.
  • Erosheva, E. A., Fienberg, S. E. and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. Ann. Appl. Stat. 1 502–537.
  • Everitt, B. S. (1988). A finite mixture model for the clustering of mixed-mode data. Statist. Probab. Lett. 6 305–309.
  • Everitt, B. S. and Merette, C. (1988). The clustering of mixed-mode data: A comparison of possible approaches. J. Appl. Stat. 17 283–297.
  • Filmer, D. and Pritchett, L. H. (2001). Estimating wealth effects without expenditure data—Or tears: An application to educational enrollments in states of India. Demography 38 115–132.
  • Fokoue, E. and Titterington, D. M. (2003). Mixtures of factor analysers. Bayesian estimation and inference by stochastic simulation. Machine Learning 50 73–94.
  • Fox, J.-P. (2010). Bayesian Item Response Modeling: Theory and Applications. Springer, New York.
  • Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering methods? Answers via model-based cluster analysis. Computer Journal 41 578–588.
  • Friel, N. and Wyse, J. (2011). Estimating the evidence—A review. Stat. Neerl. 66 288–308.
  • Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer, New York.
  • Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003). Bayesian Data Analysis. Chapman & Hall/CRC, London.
  • Geweke, J., Keane, M. and Runkle, D. (1994). Alternative computational approaches to inference in the multinomial probit model. The Review of Economics and Statistics 76 609–632.
  • Geweke, J. F. and Zhou, G. (1996). Measuring the pricing error of arbitrage pricing theory. Review of Financial Studies 9 557–587.
  • Ghahramani, Z. and Hinton, G. E. (1997). The EM algorithm for mixtures of factor analyzers. Technical report, Univ. Toronto.
  • Gollini, I. and Murphy, T. B. (2013). Mixture of latent trait analyzers for model-based clustering of categorical data. Statist. Comput. 1–20.
  • Gormley, I. C. and Murphy, T. B. (2006). Analysis of Irish third-level college applications data. J. Roy. Statist. Soc. Ser. A 169 361–379.
  • Gormley, I. C. and Murphy, T. B. (2008). A mixture of experts model for rank data with applications in election studies. Ann. Appl. Stat. 2 1452–1477.
  • Gruhl, J., Erosheva, E. A. and Crane, P. K. (2013). A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann. Appl. Stat. 7 2361–2383.
  • Gwatkin, D. R., Rutstein, S., Johnson, K., Suliman, E., Wagstaff, A. and Amouzou, A. (2007). Socio-economic differences in health, nutrition, and population within developing countries: An Overview. Country Reports on HNP and Poverty, The World Bank, Washington, DC.
  • Handcock, M. S., Raftery, A. E. and Tantrum, J. M. (2007). Model-based clustering for social networks. J. Roy. Statist. Soc. Ser. A 170 301–354.
  • Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods. Springer, New York.
  • Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
  • Hunt, L. and Jorgensen, M. (1999). Mixture model clustering using the MULTIMIX program. Aust. N. Z. J. Stat. 41 153–171.
  • Hunt, L. and Jorgensen, M. (2003). Mixture model clustering for mixed data with missing information. Comput. Statist. Data Anal. 41 429–440.
  • Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixture of local experts. Neural Comput. 3 79–87.
  • Johnson, V. E. and Albert, J. H. (1999). Ordinal Data Modeling. Springer, New York.
  • Kahn, K., Tollman, S. M., Collinson, M. A., Clark, S. J., Twine, R., Clark, B. D., Shabangu, M., Gómez-Olivé, F. X., Mokoena, O. and Garenne, M. L. (2007). Research into health, population and social transitions in rural South Africa: Data and methods of the Agincourt Health and Demographic Surveillance System1. Scandinavian Journal of Public Health 35 8–20.
  • Lawrence, C. J. and Krzanowski, W. J. (1996). Mixture separation for mixed-mode data. Statist. Comput. 6 85–92.
  • Le Cam, L. and Yang, G. L. (1990). Asymptotics in Statistics: Some Basic Concepts. Springer, New York.
  • Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis. Statist. Sinica 14 41–67.
  • Lord, F. M. (1952). The relation of the reliability of multiple-choice tests to the distribution of item difficulties. Psychometrika 17 181–194.
  • Lord, F. M. and Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Addison-Wesley, Reading, MA.
  • Masters, G. (1982). A Rasch model for partial credit scoring. Psychometrika 47 149–174.
  • McCulloch, R. and Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. J. Econometrics 64 207–240.
  • McKenzie, D. J. (2005). Measuring inequality with asset indicators. Journal of Population Economics 18 229–260.
  • McNicholas, P. D. and Murphy, T. B. (2008). Parsimonious Gaussian mixture models. Stat. Comput. 18 285–296.
  • McParland, D. and Gormley, I. C. (2013). Clustering Ordinal Data via Latent Variable Models. Studies in Classification, Data Analysis, and Knowledge Organization 547. Springer, Berlin.
  • McParland, D., Gormley, I., McCormick, T. H., Clark, S. J., Kabudula, C. and Collinson, M. A. (2014a). Supplement to “Clustering South African households based on their asset status using latent variable models.” DOI:10.1214/14-AOAS726SUPPA, DOI:10.1214/14-AOAS726SUPPB, DOI:10.1214/14-AOAS726SUPPC.
  • McParland, D., Gormley, I. C., Brennan, L. and Roche, H. M. (2014b). Clustering mixed continuous and categorical data from the LIPGENE study: Examining the interaction of nutrients and genotype in the metabolic syndrome. Technical report, Univ. College Dublin.
  • Murray, J. S., Dunson, D. B., Carin, L. and Lucas, J. E. (2013). Bayesian Gaussian copula factor models for mixed data. J. Amer. Statist. Assoc. 108 656–665.
  • Muthén, B. and Shedden, K. (1999). Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55 463–469.
  • Nobile, A. (1998). A hybrid Markov chain for the Bayesian analysis of the multinomial probit model. Statist. Comput. 8 229–242.
  • Quinn, K. M. (2004). Bayesian factor analysis for mixed ordinal and continuous responses. Political Analysis 12 338–353.
  • Rao, C. R. (1995). A review of canonical coordinates and an alternative to correspondence analysis using Hellinger distance. Qüestiió (2) 19 23–63.
  • Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. The Danish Institute for Educational Research, Copenhagen.
  • Rutstein, S. O. and Johnson, K. (2004). The DHS wealth index. DHS comparative Reports No. 6, ORC Macro, Calverton, MD.
  • Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monographs 17.
  • Stephens, M. (2000). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 795–809.
  • Svalfors, S. (2006). The Moral Economy of Class: Class and Attitudes in Comparative Perspective. Stanford Univ. Press, Stanford, CA.
  • Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology 16 433–451.
  • Vermunt, J. K. (2001). The use of restricted latent class models for defining and testing nonparametric and parametric item response theory models. Appl. Psychol. Meas. 25 283–294.
  • Vyas, S. and Kumaranayake, L. (2006). Constructing socio-economic status indices: How to use principal components analysis. Health Policy Plan 21 459–468.
  • Weeden, K. A. and Grusky, D. B. (2012). The three worlds of inequality. American Journal of Sociology 117 1723–1785.
  • Willse, A. and Boik, R. J. (1999). Identifiable finite mixtures of location models for clustering mixed-mode data. Statist. Comput. 9 111–121.
  • Zhang, X., Boscardin, W. J. and Belin, T. R. (2008). Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Comput. Statist. Data Anal. 52 3697–3708.

Supplemental materials