The Annals of Applied Statistics

Clustering the prevalence of pediatric chronic conditions in the United States using distributed computing

Yuchen Zheng and Nicoleta Serban

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


This research paper presents an approach to clustering the prevalence of chronic conditions among children with public insurance in the United States. The data consist of prevalence estimates at the community level for 25 pediatric chronic conditions. We employ a spatial clustering algorithm to identify clusters of communities with similar chronic condition prevalences. The primary challenge is the computational effort needed to estimate the spatial clustering for all communities in the U.S. To address this challenge, we develop a distributed computing approach to spatial clustering. Overall, we found that the burden of chronic conditions in rural communities tends to be similar but with wide differences in urban communities. This finding suggests similar interventions for managing chronic conditions in rural communities but targeted interventions in urban areas.

Article information

Ann. Appl. Stat., Volume 12, Number 2 (2018), 915-939.

Received: November 2017
Revised: April 2018
First available in Project Euclid: 28 July 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Distributed computing Medicaid pediatric chronic conditions spatial clustering


Zheng, Yuchen; Serban, Nicoleta. Clustering the prevalence of pediatric chronic conditions in the United States using distributed computing. Ann. Appl. Stat. 12 (2018), no. 2, 915--939. doi:10.1214/18-AOAS1173.

Export citation


  • Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference 483–485. ACM, New York.
  • Besag, J. (1986). On the statistical analysis of dirty pictures. J. Roy. Statist. Soc. Ser. B 48 259–302.
  • Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases. J. Roy. Statist. Soc. Ser. A 154 143–155.
  • Bezanson, J., Edelman, A., Karpinski, S. and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Rev. 59 65–98.
  • Birant, D. and Kut, A. (2007). ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data Knowl. Eng. 60 208–221.
  • Cameron, E., Battle, K. E., Bhatt, S., Weiss, D. J., Bisanzio, D., Mappin, B., Dalrymple, U., Hay, S. I., Smith, D. L., Griffin, J. T. et al. (2015). Defining the relationship between infection prevalence and clinical incidence of Plasmodium falciparum malaria. Nat. Commun. 6 Art. ID 8170.
  • Carson, C., Belongie, S., Greenspan, H. and Malik, J. (2002). Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24 1026–1038.
  • Center for Medicare and Medicaid Services (2017a). September 2017 Medicaid and CHIP enrollment data highlights. Available at
  • Center for Medicare and Medicaid Services (2017b). Quality of care health disparities. Available at
  • Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Olukotun, K. and Ng, A. Y. (2007). Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 281–288.
  • Cockerham, W. C., Hamby, B. W. and Oates, G. R. (2017). The social determinants of chronic disease. Am. J. Prev. Med. 52 S5–S12.
  • Cressie, N. A. C. (2015). Statistics for Spatial Data, revised ed. Wiley, New York. Paperback edition of the 1993 edition [MR1239641].
  • Davila-Payan, C., DeGuzman, M., Johnson, K., Serban, N. and Swann, J. (2015). Estimating prevalence of overweight or obese children and adolescents in small geographic areas using publicly available data. Prev. Chronic Dis. 12. DOI:10.5888/pcd12.140229.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1–38. With discussion.
  • Diggle, P. J. and Giorgi, E. (2016). Model-based geostatistics for prevalence mapping in low-resource settings. J. Amer. Statist. Assoc. 111 1096–1120.
  • Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning 29. ACM, New York.
  • Elliot, P., Wakefield, J. C., Best, N. G. and Briggs, D. J. (2000). Spatial Epidemiology: Methods and Applications. Oxford Univ. Press, Oxford.
  • Ester, M., Kriegel, H.-P., Sander, J., Xu, X. et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96) 226–231.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41 578–588.
  • Furrer, R., Genton, M. G. and Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. J. Comput. Graph. Statist. 15 502–523.
  • Gotway, C. A. and Young, L. J. (2002). Combining incompatible spatial data. J. Amer. Statist. Assoc. 97 632–648.
  • Green, P. J. and Richardson, S. (2002). Hidden Markov models and disease mapping. J. Amer. Statist. Assoc. 97 1055–1070.
  • Jiang, H. and Serban, N. (2012). Clustering random curves under spatial interdependence with application to service accessibility. Technometrics 54 108–119.
  • Kopec, J. A., Sayre, E. C., Flanagan, W. M., Fines, P., Cibere, J., Rahman, M. M., Bansback, N. J., Anis, A. H., Jordan, J. M., Sobolev, B. et al. (2010). Development of a population-based microsimulation model of osteoarthritis in Canada. Osteoarthr. Cartil. 18 303–311.
  • Kriegel, H.-P., Kröger, P., Sander, J. and Zimek, A. (2011). Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1 231–240.
  • Lawson, A., Biggeri, A., Bohning, D., Lesaffre, E., Viel, J.-F. and Bertollini, R. (1999). Disease Mapping and Risk Assessment for Public Health. Wiley, New York.
  • Liu, Q. and Ihler, A. (2012). Distributed parameter estimation via pseudo-likelihood. In International Conference on Machine Learning (ICML) 1487–1494.
  • Meyer, S. and Held, L. (2014). Power-law models for infectious disease spread. Ann. Appl. Stat. 8 1612–1639.
  • Neff, J. M., Sharp, V. L., Muldoon, J., Graham, J., Popalisky, J. and Gay, J. C. (2002). Identifying and classifying children with chronic conditions using administrative data with the clinical risk group classification system. Ambul. Pediatr. 2 71–79.
  • Openshaw, S., Charlton, M., Wymer, C. and Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point data sets. Int. J. Geogr. Inf. Syst. 1 335–358.
  • Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc. 66 846–850.
  • Ripley, B. D. (2005). Spatial Statistics. Wiley, New York.
  • Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications. Monographs on Statistics and Applied Probability 104. Chapman & Hall/CRC, Boca Raton, FL.
  • Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. Roy. Statist. Soc. Ser. B 71 319–392.
  • Rue, H. and Tjelmeland, H. (2002). Fitting Gaussian Markov random fields to Gaussian fields. Scand. J. Stat. 29 31–49.
  • The World Health Organization (2005). Chronic diseases and their common risk factors. Available at
  • United States Department of Agriculture (2004). Measuring rurality: Rural-urban continuum codes. Available at
  • Wakefield, J. C. (2006). Disease mapping and spatial regression with count data. Biostatistics 8 158–183.
  • Waller, L. A. and Gotway, C. A. (2004). Applied Spatial Statistics for Public Health Data. Wiley, Hoboken, NJ.
  • Wang, M., Wang, A. and Li, A. (2006). Mining spatial-temporal clusters from geo-databases. In Advanced Data Mining and Applications. Lecture Notes in Artificial Intelligence 4093 263–270. Springer, Berlin.
  • Wolfe, J., Haghighi, A. and Klein, D. (2008). Fully distributed EM for very large datasets. In Proceedings of the 25th International Conference on Machine Learning 1184–1191. ACM, New York.
  • Zheng, Y. and Serban, N. (2018). Supplement to “Clustering the prevalence of pediatric chronic conditions in the United States using distributed computing.” DOI:10.1214/18-AOAS1173SUPP.

Supplemental materials

  • Supplement to “Clustering the prevalence of pediatric chronic conditions in the United States using distributed computing”. Supplementary Materials contain four sections. In Supplementary Material A, we describe the approach for estimating the census tract prevalence for chronic conditions using the Medicaid Analytic eXtract (MAX) claims data. In Supplementary Material B, we provide further details on the selection of the number of clusters. In Supplementary Material C, we present additional mosaic maps showing the composition of each cluster by state and urbanicity for all the states in our analysis. In Supplementary Material D, we share the implementation of the distributed computing approach for spatial clustering along with a read me file for guidance on how to use the software implementation.