The Annals of Applied Statistics

Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (MESA)

Hai Liu, Shuangge Ma, Richard Kronmal, and Kung-Sik Chan

Full-text: Open access


We analyze the Agatston score of coronary artery calcium (CAC) from the Multi-Ethnic Study of Atherosclerosis (MESA) using the semiparametric zero-inflated modeling approach, where the observed CAC scores from this cohort consist of high frequency of zeroes and continuously distributed positive values. Both partially constrained and unconstrained models are considered to investigate the underlying biological processes of CAC development from zero to positive, and from small amount to large amount. Different from existing studies, a model selection procedure based on likelihood cross-validation is adopted to identify the optimal model, which is justified by comparative Monte Carlo studies. A shrinkaged version of cubic regression spline is used for model estimation and variable selection simultaneously. When applying the proposed methods to the MESA data analysis, we show that the two biological mechanisms influencing the initiation of CAC and the magnitude of CAC when it is positive are better characterized by an unconstrained zero-inflated normal model. Our results are significantly different from those in published studies, and may provide further insights into the biological mechanisms underlying CAC development in humans. This highly flexible statistical framework can be applied to zero-inflated data analyses in other areas.

Article information

Ann. Appl. Stat., Volume 6, Number 3 (2012), 1236-1255.

First available in Project Euclid: 31 August 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Cardiovascular disease coronary artery calcium likelihood cross-validation model selection penalized spline proportional constraint shrinkage


Liu, Hai; Ma, Shuangge; Kronmal, Richard; Chan, Kung-Sik. Semiparametric zero-inflated modeling in multi-ethnic study of atherosclerosis (MESA). Ann. Appl. Stat. 6 (2012), no. 3, 1236--1255. doi:10.1214/11-AOAS534.

Export citation


  • Agarwal, D. K., Gelfand, A. E. and Citron-Pousty, S. (2002). Zero-inflated models with application to spatial count data. Environ. Ecol. Stat. 9 341–355.
  • Agatston, A. S., Janowitz, W. R., Hildner, F. J., Zusmer, N. R., Viamonte, M. Jr. and Detrano, R. (1990). Quantification of coronary artery calcium using ultrafast computed tomography. J. Am. Coll. Cardiol. 15 827–832.
  • Albert, P. S., Follmann, D. A. and Barnhart, H. X. (1997). A generalized estimating equation approach for modeling random length binary vector data. Biometrics 53 1116–1124.
  • Amemiya, T. (1984). Tobit models: A survey. J. Econometrics 24 3–61.
  • Arlot, S. and Celisse, A. (2010). A survey of cross-validation procedures for model selection. Stat. Surv. 4 40–79.
  • Bild, D. E., Bluemke, D. A., Burke, G. L., Detrano, R., Diez Roux, A. V. et al. (2002). Multi-ethnic study of atherosclerosis: Objectives and design. American Journal of Epidemiology 156 871–881.
  • Blough, D. K., Madden, C. W. and Hornbrook, M. C. (1999). Modeling risk using generalized linear models. Journal of Health Economics 18 153–171.
  • Couturier, D.-L. and Victoria-Feser, M.-P. (2010). Zero-inflated truncated generalized Pareto distribution for the analysis of radio audience data. Ann. Appl. Stat. 4 1824–1846.
  • Durrleman, S. and Simon, R. (1989). Flexible regression models with cubic splines. Stat. Med. 8 551–561.
  • Han, C. and Kronmal, R. (2006). Two-part models for analysis of Agatston scores with possible proportionality constraints. Comm. Statist. Theory Methods 35 99–111.
  • Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica 47 153–161.
  • Huang, J., Horowitz, J. L. and Wei, F. (2010). Variable selection in nonparametric additive models. Ann. Statist. 38 2282–2313.
  • Kronmal, R. (2005). Recommendation for the analysis of coronary calcium data. Technical report, MESA Coordinating Center, Univ. Washington.
  • Lam, K. F., Xue, H. and Cheung, Y. B. (2006). Semiparametric analysis of zero-inflated count data. Biometrics 62 996–1003.
  • Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34 1–14.
  • Li, K.-C. and Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17 1009–1052.
  • Liu, H. and Chan, K. S. (2011). Generalized additive models for zero-inflated data with partial constraint. Scand. J. Stat. 38 650–665.
  • Liu, H., Ciannelli, L., Decker, M. B., Ladd, C. and Chan, K.-S. (2011). Nonparametric threshold model of zero-inflated spatio-temporal data with application to shifts in jellyfish distribution. J. Agric. Biol. Environ. Stat. 16 185–201.
  • Liu, A., Kronmal, R., Zhou, X. and Ma, S. (2011). Determination of proportionality in two-part models and analysis of Multi-Ethnic Study of Atherosclerosis (MESA). Statistics and Its Inference 4 475–487.
  • Ma, S., Liu, A., Carr, J., Post, W. and Kronmal, R. (2010). Statistical modeling of Agatston score in multi-ethnic study of atherosclerosis (MESA). PLoS ONE 5 e12036.
  • McClelland, R. L., Chung, H. J., Detrano, R., Post, W. and Kronmal, R. A. (2006). Distribution of coronary artery calcium by race, gender, and age—Results from the multi-ethnic study of atherosclerosis (MESA). Circulation 113 30–37.
  • Miller, M. E., Hui, S. L. and Tierney, W. M. (1991). Validation techniques for logistic regression models. Stat. Med. 10 1213–1226.
  • Min, Y. and Agresti, A. (2005). Random effect models for repeated measures of zero-inflated count data. Stat. Model. 5 1–19.
  • Min, J. K., Lin, F. Y., Gidseg, D. S., Weinsaft, J. W., Berman, D. S., Shaw, L. J., Rozanski, A. and Callister, T. Q. (2010). Determinants of coronary calcium conversion among patients with a normal coronary calcium scan: What is the “warranty period” for remaining normal? J. Am. Coll. Cardiol. 55 1110–1117.
  • Moulton, L. H., Curriero, F. C. and Barroso, P. F. (2002). Mixture models for quantitative HIV RNA data. Stat. Methods Med. Res. 11 317–325.
  • Mullahy, J. (1986). Specification and testing of some modified count data models. J. Econometrics 33 341–365.
  • Picard, R. R. and Cook, R. D. (1984). Cross-validation of regression models. J. Amer. Statist. Assoc. 79 575–583.
  • Polonsky, T. S., McClelland, R. L., Jorgensen, N. W., Bild, D. E., Burke, G. L., Guerci, A. D. and Greenland, P. (2010). Coronary artery calcium score and risk classification for coronary heart disease prediction. JAMA 303 1610–1616.
  • Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. Cambridge Series in Statistical and Probabilistic Mathematics 12. Cambridge Univ. Press, Cambridge.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shao, J. (1993). Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 486–494.
  • Shao, J. and Tu, D. S. (1995). The Jackknife and Bootstrap. Springer, New York.
  • Smyth, P. (2000). Model selection for probabilistic clustering using cross-validated likelihood. Stat. Comput. 9 63–72.
  • Welsh, A. H. and Zhou, X. H. (2006). Estimating the retransformed mean in a heteroscedastic two-part model. J. Statist. Plann. Inference 136 860–881.
  • Wood, S. N. (2003). Thin plate regression splines. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 95–114.
  • Wood, S. N. (2006). Generalized Additive Models: An Introduction with $R$. Chapman & Hall/CRC, Boca Raton, FL.