The Annals of Applied Statistics

Integrative Model-based clustering of microarray methylation and expression data

Matthias Kormaksson, James G. Booth, Maria E. Figueroa, and Ari Melnick

Full-text: Open access


In many fields, researchers are interested in large and complex biological processes. Two important examples are gene expression and DNA methylation in genetics. One key problem is to identify aberrant patterns of these processes and discover biologically distinct groups. In this article we develop a model-based method for clustering such data. The basis of our method involves the construction of a likelihood for any given partition of the subjects. We introduce cluster specific latent indicators that, along with some standard assumptions, impose a specific mixture distribution on each cluster. Estimation is carried out using the EM algorithm. The methods extend naturally to multiple data types of a similar nature, which leads to an integrated analysis over multiple data platforms, resulting in higher discriminating power.

Article information

Ann. Appl. Stat., Volume 6, Number 3 (2012), 1327-1347.

First available in Project Euclid: 31 August 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Integrative model-based clustering microarray data mixture models EM algorithm methylation expression AML


Kormaksson, Matthias; Booth, James G.; Figueroa, Maria E.; Melnick, Ari. Integrative Model-based clustering of microarray methylation and expression data. Ann. Appl. Stat. 6 (2012), no. 3, 1327--1347. doi:10.1214/11-AOAS533.

Export citation


  • Armstrong, S. A., Staunton, J. E., Silverman, L. B., Pieters, R., den Boer, M. L., Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R. and Korsmeyer, S. J. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30 41–47.
  • Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49 803–821.
  • Cheeseman, P. and Stutz, J. (1995). Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatesky-Shapiro, P. Smyth and R. Uthurusamy, eds.) 49 153–180. AAAI Press, Palo Alto, CA.
  • Christensen, B. C., Houseman, E. A., Marsit, C. J., Zheng, S., Wrensch, M. R., Wiemels, J. L., Nelson, H. H., Karagas, M. R., Padbury, J. F., Bueno, R., Sugarbaker, D. J., Yeh, R.-F., Wiencke, J. K. and Kelsey, K. T. (2009). Aging and environmental exposures alter tissue-specific DNA methylation dependent upon CpG island context. PLOS Genetics 5 e1000602.
  • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1–38.
  • Figueroa, M. E., Reimers, M., Thompson, R. F., Ye, K., Li, Y., Selzer, R. R., Fridriksson, J., Paietta, E., Wiernik, P., Green, R. D., Greally, J. M. and Melnick, A. (2008). An integrative genomic and epigenomic approach for the study of transcriptional regulation. PLoS One 3 e1882.
  • Figueroa, M. E., Lugthart, S., Li, Y., Erpelinck-Verschueren, C., Deng, X., Christos, P. J., Schifano, E., Booth, J., van Putten, W., Skrabanek, L., Campagne, F., Mazumdar, M., Greally, J. M., Valk, P. J. M., Lowenberg, B., Delwelsend, R. and Melnick, A. (2010). Epigenetic signatures identify biologically distinct subtypes in acute myeloid leukemia. Cancer Cell 17 13–27.
  • Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41 578–588.
  • Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611–631.
  • Friedman, J. H. and Meulman, J. J. (2003). Clustering objects on subsets of attributes. Technical report, Stanford Univ., Dept. Statistics and Stanford Linear Accelerator Center.
  • Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531–537.
  • Heard, N. A., Holmes, C. C. and Stephens, D. A. (2006). A quantitative study of gene regulation involved in the immune response of anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves. J. Amer. Statist. Assoc. 101 18–29.
  • Houseman, E. A., Christensen, B. C., Yeh, R.-F., Marsit, C. J. et al. (2008). Model-based clustering of DNA methylation array data: A recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics 9, Article No. 365.
  • Ibrahim, J. G., Chen, M.-H. and Gray, R. J. (2002). Bayesian models for gene expression with DNA microarray data. J. Amer. Statist. Assoc. 97 88–99.
  • Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U. and Speed, T. P. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 4 249–264.
  • Ji, Y., Wu, C., Liu, P., Wang, J. and Coombes, K. R. (2005). Applications of beta-mixture models in bioinformatics. Bioinformatics 21 2118–2122.
  • Kettenring, J. R. (2006). The practice of cluster analysis. J. Classification 23 3–30.
  • Kiefer, N. M. (1978). Discrete parameter variation: Efficient estimation of a switching regression model. Econometrica 46 427–434.
  • Kim, S., Tadesse, M. G. and Vannucci, M. (2006). Variable selection in clustering via Dirichlet process mixture models. Biometrika 93 877–893.
  • Kormaksson, M., Booth, J. G., Figueroa, M. E. and Melnick, A. (2012). Supplement to “Integrative model-based clustering of microarray methylation and expression data.” DOI:10.1214/11-AOAS533SUPP.
  • Lindsay, B. G. (1988). Composite likelihood methods. In Statistical Inference from Stochastic Processes (Ithaca, NY, 1987). Contemp. Math. 80 221–239. Amer. Math. Soc., Providence, RI.
  • McLachlan, G. J. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. Statistics: Textbooks and Monographs 84. Dekker Inc., New York.
  • McLachlan, G. J., Bean, R. W. and Peel, D. (2002). A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18 413–422.
  • McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley, New York.
  • Rodenhiser, D. and Mann, M. (2006). Epigenetics and human disease: Translating basic biology into clinical applications. CMAJ 174 341–348.
  • Scott, A. J. and Symons, M. J. (1971). Clustering methods based on likelihood ratio criteria. Biometrics 27 387–397.
  • Siegmund, K. D., Laird, P. W. and Laird-Offringa, I. A. (2004). A comparison of cluster analysis methods using DNA methylation data. Bioinformatics 20 1896–1904.
  • Symons, M. J. (1981). Clustering criteria and multivariate normal mixtures. Biometrics 37 35–43.
  • Tadesse, M. G., Ibrahim, J. G. and Mutter, G. L. (2003). Identification of differentially expressed genes in high-density oligonucleotide arrays accounting for the quantification limits of the technology. Biometrics 59 542–554.
  • Tadesse, M. G., Sha, N. and Vannucci, M. (2005). Bayesian variable selection in clustering high-dimensional data. J. Amer. Statist. Assoc. 100 602–617.
  • Thompson, R. F., Reimers, M., Khulan, B., Gissot, M., Richmond, T. A., Chen, Q., Zheng, X., Kim, K. and Greally, J. M. (2008). An analytical pipeline for genomic representations used for cytosine methylation studies. Bioinformatics 24 1161–1167.
  • Valk, P. J., Verhaak, R. G., Beijen, M. A., Erpelinck, C. A., van Waalwijk van Doorn-Khosrovani, S. B., Boer, J. M., Beverloo, H. B., Moorhouse, M. J., van der Spek, P. J., Lowenberg, B. and Delwel, R. (2004). Prognostically useful gene-expression profiles in acute myeloid leukemia. N. Engl. J. Med. 350 1617–1628.
  • van der Laan, M. J. and Pollard, K. S. (2003). A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap. J. Statist. Plann. Inference 117 275–303.

Supplemental materials

  • Supplementary material: Simulation and details of EM algorithms. We perform a simulation study to assess the performance of our clustering algorithm in the presence of sparse correlation structure. We also derive the steps involved in maximizing the likelihoods of the several models presented in this paper.