The Annals of Applied Statistics

Latent demographic profile estimation in hard-to-reach groups

Tyler H. McCormick and Tian Zheng

Full-text: Open access


The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structure to estimate demographic characteristics of these subpopulations using Aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership. We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data from McCarty et al. [Human Organization 60 (2001) 28–39], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.

Article information

Ann. Appl. Stat., Volume 6, Number 4 (2012), 1795-1813.

First available in Project Euclid: 27 December 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Aggregated relational data hard-to-reach populations hierarchical model social network survey design


McCormick, Tyler H.; Zheng, Tian. Latent demographic profile estimation in hard-to-reach groups. Ann. Appl. Stat. 6 (2012), no. 4, 1795--1813. doi:10.1214/12-AOAS569.

Export citation


  • Bernard, H. R., Johnsen, E. C., Killworth, P. D. and Robinson, S. (1991). Estimating the size of an average personal network and of an event subpopulation: Some empirical results. Social Science Research 20 109–121.
  • Centers for Disease Control (2011). WISQARS leading causes of death reports.
  • Centers for Disease Control and Prevention (2011). Centers for Disease Control and Prevention, National Center for Injury Prevention and Control. Web-based Injury Statistics Query and Reporting System (WISQARS).
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1–38.
  • DiPrete, T. A., Gelman, A., McCormick, T. H., Teitler, J. and Zheng, T. (2011). Segregation in social networks based on acquaintanceship and trust. American Journal of Sociology 116 1234–1283.
  • Goel, S. and Salganik, M. J. (2009). Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28 2202–2229.
  • Heckathorn, D. D. (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems 44 174–199.
  • Heckathorn, D. D. (2002). Respondent-driven sampling II: Deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49 11–34.
  • Hoff, P. D. (2005). Bilinear mixed-effects models for dyadic data. J. Amer. Statist. Assoc. 100 286–295.
  • Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090–1098.
  • Killworth, P. D., Johnsen, E. C., Bernard, H. R., Shelley, G. A. and McCarty, C. (1990). Estimating the size of personal networks. Social Networks 12 289–312.
  • Killworth, P. D., McCarty, C., Bernard, H. R., Shelly, G. A. and Johnsen, E. C. (1998a). Estimation of seroprevalence, rape, and homelessness in the U.S. using a social network approach. Evaluation Review 22 289–308.
  • Killworth, P. D., Johnsen, E. C., McCarty, C., Shelly, G. A. and Bernard, H. R. (1998b). A social network approach to estimating seroprevalence in the United States. Social Networks 20 23–50.
  • Killworth, P. D., McCarty, C., Bernard, H. R., Johnsen, E. C., Domini, J. and Shelly, G. A. (2003). Two interpretations of reports of knowledge of subpopulation sizes. Social Networks 25 141–160.
  • Lavallée, P. (2007). Indirect Sampling. Springer, New York.
  • Lawson, C. L. and Hanson, R. J. (1974). Solving Least Squares Problems. Prentice Hall International, Englewood Cliffs, NJ.
  • Lohr, S. L. (1999). Sampling: Design and Analysis. Duxbury Press, Belmont, CA.
  • McCarty, C., Killworth, P. D., Bernard, H. R., Johnsen, E. and Shelley, G. A. (2001). Comparing two methods for estimating network size. Human Organization 60 28–39.
  • McCormick, T. H., Salganik, M. J. and Zheng, T. (2010). How many people do you know? Efficiently estimating personal network size. J. Amer. Statist. Assoc. 105 59–70.
  • McCormick, T. H. and Zheng, T. (2007). Adjusting for recall bias in “How Many X’s Do You Know?” surveys. In Proceedings of the Joint Statistical Meetings. Salt Lake City, UT.
  • McCormick, T. H., Moussa, A., Ruf, J., DiPrete, T. A., Gelman, A., Teitler, J. and Zheng, T. (2009). Comparing two methods for predicting opinions using social structure. In Proceedings of the Joint Statistical Meetings. Washington, DC.
  • Federal Bureau of Investigation (1999). Crime in the United States.
  • Office of Advocacy, U.S. Small Business Administration (1997). Characteristics of small business employees and owners.
  • Salganik, M. J., Mello, M. B., Adbo, A. H., Bertoni, N., Fatzio, D. and Bastos, F. I. (2011). The game of contacts: Estimating the social visibility of groups. Social Networks 33 70–78.
  • Shelley, G., Bernard, H., Killworth, P., Johnsen, E. and McCarty, C. (1995). Who knows you HIV status? What HIV+ patients and their network members know about each other. Social Networks 17 189–217.
  • UNAIDS (2003). Estimating the size of popualtions at risk for HIV 03.36E. UNAIDS, Geneva.
  • Zheng, T., Salganik, M. J. and Gelman, A. (2006). How many people do you know in prison?: Using overdispersion in count data to estimate social structure in networks. J. Amer. Statist. Assoc. 101 409–423.