Institute of Mathematical Statistics Collections

An ensemble approach to improved prediction from multitype data

Jennifer Clarke and David Seo

Full-text: Open access

Abstract

We have developed a strategy for the analysis of newly available binary data to improve outcome predictions based on existing data (binary or non-binary). Our strategy involves two modeling approaches for the newly available data, one combining binary covariate selection via LASSO with logistic regression and one based on logic trees. The results of these models are then compared to the results of a model based on existing data with the objective of combining model results to achieve the most accurate predictions. The combination of model predictions is aided by the use of support vector machines to identify subspaces of the covariate space in which specific models lead to successful predictions. We demonstrate our approach in the analysis of single nucleotide polymorphism (SNP) data and traditional clinical risk factors for the prediction of coronary heart disease.

Chapter information

Source
Bertrand Clarke and Subhashis Ghosal, eds., Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh (Beachwood, Ohio, USA: Institute of Mathematical Statistics, 2008), 302-317

Dates
First available in Project Euclid: 28 April 2008

Permanent link to this document
https://projecteuclid.org/euclid.imsc/1209398476

Digital Object Identifier
doi:10.1214/074921708000000219

Mathematical Reviews number (MathSciNet)
MR2459232

Subjects
Primary: 62M20: Prediction [See also 60G25]; filtering [See also 60G35, 93E10, 93E11] 62H30: Classification and discrimination; cluster analysis [See also 68T10, 91C20]
Secondary: 62P10: Applications to biology and medical sciences

Keywords
model ensembles prediction single nucleotide polymorphism (SNP) support vector machines variable selection

Rights
Copyright © 2008, Institute of Mathematical Statistics

Citation

Clarke, Jennifer; Seo, David. An ensemble approach to improved prediction from multitype data. Pushing the Limits of Contemporary Statistics: Contributions in Honor of Jayanta K. Ghosh, 302--317, Institute of Mathematical Statistics, Beachwood, Ohio, USA, 2008. doi:10.1214/074921708000000219. https://projecteuclid.org/euclid.imsc/1209398476


Export citation

References

  • [1] American Heart Association (2006). Heart Disease and Stroke Statistics – 2006 Update 2–10.
  • [2] Armitrage, P. (1955). Tests for linear trends in proportions and frequencies. Biometrics 11 375–386.
  • [3] Boser, B., Guyon, I. and Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In 5th Annual ACM Workshop on COLT (D. Haussler, ed.) 141–152. ACM Press.
  • [4] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth Press, Belmont, CA.
  • [5] Chang, C.-C. and Lin, C.-J. (2001). LIBSVM – A library for support vector machines. Software available at http://www.csie.ntu.edu.tw/˜cjlin/libsvm.
  • [6] Chipman, H., George, E. and McCullough, R. (2002). Bayesian treed models. Machine Learning 48 299–320.
  • [7] Clyde, M. (1999). Bayesian model averaging and model search strategies. In Bayesian Statistics 6 (J. Bernardo, J. Berger, A. Dawid and A. Smith, eds.) 157–185. Oxford University Press, Oxford, UK.
  • [8] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning 20 273–297.
  • [9] Devlin, B., Bacanu, S.-A. and Roeder, K. (2004). Genomic control to the extreme. Nature Genetics 36 1129–1130.
  • [10] Devlin, B. and Roeder, K. (1999). Genomic control for association studies. Biometrics 55 997–1004.
  • [11] Dietterich, T. (2000). Ensemble methods in machine learning. Lecture Notes in Comput. Sci. 1857 1–15. Available at citeseer.ist.psu.edu/ dietterich00ensemble.html.
  • [12] Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D. and Weingessel, A. (2006). The e1071 Package: Miscellaneous functions of the department of statistics (e1071). Technische Universität Wien, Austria. Version 1.5-16.
  • [13] Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation 121 256–285.
  • [14] Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. System Sci. 55 119–139.
  • [15] Friedman, J. (1991). Multivariate adaptive regression splines (with discussion). Ann. Statist. 19 1–141.
  • [16] Greenland, P., Knoll, M., Stamler, J., Neaton, J., Dyer, A., Garside, D. and Wilson, P. (2003). Major risk factors as antecedents of fatal and nonfatal coronary heart disease events. J. Amer. Medical Association 290 891–897.
  • [17] Greenland, P., Smith, S. and Grundy, S. (2001). Improving coronary heart disease risk assessment in asymptomatic people: Role of traditional risk factors and noninvasive cardiovascular tests. Circulation 104 1863–1867.
  • [18] Hauser, E., Crossman, D., Granger, C., Haines, J., Jones, C., Mooser, V., McAdam, B., Winkelmann, B., Wiseman, A., Muhlstein, J., Bartel, A., Dennis, C., Dowdy, E., Estabrooks, S., Eggleston, K., Francis, S., Roche, K., Clevenger, P., Huang, L., Pedersen, B., Shah, S., Schmidt, S., Haynes, C., West, S., Asper, D., Booze, M., Sharma, S., Sundseth, S., Middleton, L., Roses, A., Hauser, M., Vance, J., Pericak-Vance, M. and Kraus, W. (2004). A genomewide scan for early-onset coronary artery disease in 438 families: The GENECARD study. Amer. J. Human Genetics 75 436–447.
  • [19] Karra, R., Vermullapalli, S., Dong, C., Herderick, E., Song, X., Slosek, K., Nevins, J., West, M., Goldschmidt-Clermont, P. and Seo, D. (2005). Molecular evidence for arterial repair in atherosclerosis. Proc. Nat. Acad. Sci. U.S.A. 102 16789–16794.
  • [20] Kooperberg, C., Ruczinski, I., LeBlanc, M. and Hsu, L. (2001). Sequence analysis using logic regression. Genetic Epidemiology 21 S626–S631.
  • [21] Lokhorst, J., Venables, B., Turlach, B. and Maechler, M. (2006). The lasso2 package: L1 constrained estimation aka “lasso.” Univ. Western Australia School of Mathematics and Statistics. Version 1.2-5. Available at http://www.maths.uwa.edu.au/˜berwin/software/lasso.html.
  • [22] Magnus, P. and Beaglehole, R. (2001). The real contribution of the major risk factors to the coronary epidemics: time to end the “only-50%” myth. Archives of Internal Medicine 161 2657–2660.
  • [23] Meyer, D. (2006). Support vector machines: The interface to libsvm in package e1071. Technische Universität Wien, Austria.
  • [24] Mosca, L. (2002). C-Reactive protein: To screen or not to screen? New England J. Medicine 347 1615–1617.
  • [25] Osborne, M., Presnell, B. and Turlach, B. (2000). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319–337.
  • [26] Pasternak, R., Abrams, J., Greenland, P., Smaha, L., Wilson, P. and Houston-Miller, N. (2003). Task force #1 – identification of coronary heart disease risk – is there a detection gap? J. American College of Cardiology 41 1863–1874.
  • [27] R Development Core Team (2006). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org.
  • [28] Ridker, P., Rifai, N., Rose, L., Buring, J. and Cook, N. (2002). Comparison of C-reactive protein and low-density lipoprotein cholesterol levels in the prediction of first cardiovascular events. New England J. Medicine 347 1557–1565.
  • [29] Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2002). Logic regression – methods and software. In Proceedings of the MSRI workshop on Nonlinear Estimation and Classification (D. Denison, M. Hansen, B. Holmes, B. Mallick and B. Yu, eds.) 333–344. Springer, New York.
  • [30] Ruczinski, I., Kooperberg, C. and LeBlanc, M. (2003). Logic regression. J. Comput. Graph. Statist. 12 475–511.
  • [31] Schölkopf, B. and Smola, A. (2002). Learning with Kernels. MIT Press, Cambridge, MA.
  • [32] Schapire, R. (1990). The strength of weak learnability. Machine Learning 5 197–227.
  • [33] Seo, D., Wang, T., Dressman, H., Hergerick, E., Iversen, E., Dong, C., Vata, K., Milano, C., Rigat, F., Pittman, J., Nevins, J., West, M. and Goldschmidt-Clermont, P. (2004). Gene expression phenotypes of atherosclerosis. Atherosclerosis, Thrombosis, and Vascular Biology 24 1922–1927.
  • [34] Sing, R., Sander, O., Beerenwinkel, N. and Lengauer, T. (2005). ROCR: Visualizing classifier performance in R. Bioinformatics 21 3940–3941. Available at http://rocr.bioinf.mpi-sb.mpg.de/.
  • [35] Sutton, C. (1991). Improving classification trees with simulated annealing. In Proceedings of the 23rd Symposium on the Interface (E. Kazimadas, ed.) 333–344. Interface Foundation of North America.
  • [36] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • [37] Tzeng, J.-Y., Byerley, W., Devlin, B., Roeder, K. and Wasserman, L. (2003). Outlier detection and false discovery rates for whole-genome DNA matching. J. Amer. Statist. Assoc. 98 236–246.
  • [38] van Laarhoven, P. and Aarts, E. (1987). Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Norwell, MA.
  • [39] Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer, New York.
  • [40] Wilson, P., D’Agostino, R., Levy, D., Belanger, A., Silbershatz, H. and Kannel, W. (1998). Prediction of coronary heart disease using risk factor categories. Circulation 97 1837–1847.
  • [41] Xu, H., Gregory, S., Hauser, E., Stenger, J., Pericak-Vance, M., Vance, J., Zuchner, S. and Hauser, M. (2005). SNPselector: a web tool for selecting SNPs for genetic association studies. Bioinformatics 21 4181–4186. Available at http://primer.duhs.duke.edu/.