The Annals of Applied Statistics

A Bayesian nonparametric mixture model for selecting genes and gene subnetworks

Yize Zhao, Jian Kang, and Tianwei Yu

Full-text: Open access


It is very challenging to select informative features from tens of thousands of measured features in high-throughput data analysis. Recently, several parametric/regression models have been developed utilizing the gene network information to select genes or pathways strongly associated with a clinical/biological outcome. Alternatively, in this paper, we propose a nonparametric Bayesian model for gene selection incorporating network information. In addition to identifying genes that have a strong association with a clinical outcome, our model can select genes with particular expressional behavior, in which case the regression models are not directly applicable. We show that our proposed model is equivalent to an infinity mixture model for which we develop a posterior computation algorithm based on Markov chain Monte Carlo (MCMC) methods. We also propose two fast computing algorithms that approximate the posterior simulation with good accuracy but relatively low computational cost. We illustrate our methods on simulation studies and the analysis of Spellman yeast cell cycle microarray data.

Article information

Ann. Appl. Stat., Volume 8, Number 2 (2014), 999-1021.

First available in Project Euclid: 1 July 2014

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Dirichlet process mixture ising priors density estimation feature selection microarray data


Zhao, Yize; Kang, Jian; Yu, Tianwei. A Bayesian nonparametric mixture model for selecting genes and gene subnetworks. Ann. Appl. Stat. 8 (2014), no. 2, 999--1021. doi:10.1214/14-AOAS719.

Export citation


  • Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152–1174.
  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. and Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25 25–29.
  • Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science 286 509–512.
  • Breeze, E., Harrison, E., McHattie, S., Hughes, L., Hickman, R., Hill, C., Kiddle, S., Kim, Y., Penfold, C., Jenkins, D. et al. (2011). High-resolution temporal profiling of transcripts during arabidopsis leaf senescence reveals a distinct chronology of processes and regulation. The Plant Cell Online 23 873–894.
  • Cerami, E., Gross, B., Demir, E., Rodchenkov, I., Babur, Ö., Anwar, N., Schultz, N., Bader, G. and Sander, C. (2011). Pathway commons, a web resource for biological pathway data. Nucleic Acids Res. 39 D685–D690.
  • Chenevert, J., Valtz, N. and Herskowitz, I. (1994). Identification of genes required for normal pheromone-induced cell polarization in saccharomyces cerevisiae. Genetics 136 1287–1297.
  • Cherry, J. M., Hong, E. L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E. T., Christie, K. R., Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hirschman, J. E., Hitz, B. C., Karra, K., Krieger, C. J., Miyasato, S. R., Nash, R. S., Park, J., Skrzypek, M. S., Simison, M., Weng, S. and Wong, E. D. (2012). Saccharomyces Genome Database: The genomics resource of budding yeast. Nucleic Acids Res. 40 D700–D705.
  • Do, K.-A., Müller, P. and Tang, F. (2005). A Bayesian mixture model for differential gene expression. J. Roy. Statist. Soc. Ser. C 54 627–644.
  • Dunson, D. B. (2010). Nonparametric Bayes applications to biostatistics. In Bayesian Nonparametrics (N. L. Hjort, C. Holmes, P. Müller and S. G. Walker, eds.) 223–273. Cambridge Univ. Press, Cambridge.
  • Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96–104.
  • Efron, B. (2010). Correlated $z$-values and the accuracy of large-scale statistical estimates. J. Amer. Statist. Assoc. 105 1042–1055.
  • Einmahl, J. H. J. and Van Keilegom, I. (2008). Tests for independence in nonparametric regression. Statist. Sinica 18 601–615.
  • Escobar, M. D. (1994). Estimating normal means with a Dirichlet process prior. J. Amer. Statist. Assoc. 89 268–277.
  • Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577–588.
  • Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. 96 161–173.
  • Ishwaran, H. and James, L. F. (2002). Approximate Dirichlet process computing in finite normal mixtures: Smoothing and prior information. J. Comput. Graph. Statist. 11 508–532.
  • Jacob, L., Neuvial, P. and Dudoit, S. (2012). More power via graph-structured tests for differential expression of gene networks. Ann. Appl. Stat. 6 561–600.
  • Leng, X. and Müller, H.-G. (2006). Classification using functional data analysis for temporal gene expression data. Bioinformatics 22 68–76.
  • Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24 1175–1182.
  • Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J. Amer. Statist. Assoc. 105 1202–1214.
  • Ma, S., Shi, M., Li, Y., Yi, D. and Shia, B.-C. (2010). Incorporating gene co-expression network in identification of cancer prognosis markers. BMC Bioinformatics 11 271.
  • Müller, P. and Quintana, F. A. (2004). Nonparametric Bayesian data analysis. Statist. Sci. 19 95–110.
  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • Pan, W., Xie, B. and Shen, X. (2010). Incorporating predictor network in penalized regression with application to microarray data. Biometrics 66 474–484.
  • Peng, H., Long, F. and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Analysis and Machine Intelligence 27 1226–1238.
  • Qu, L., Nettleton, D. and Dekkers, J. C. M. (2012). A hierarchical semiparametric model for incorporating intergene information for analysis of genomic data. Biometrics 68 1168–1177.
  • Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518–1524.
  • Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9 3273–3297.
  • Stingo, F. C., Chen, Y. A., Tadesse, M. G. and Vannucci, M. (2011). Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes. Ann. Appl. Stat. 5 1978–2002.
  • Strunnikov, A. V. and Jessberger, R. (1999). Structural maintenance of chromosomes (SMC) proteins: Conserved molecular properties for multiple biological functions. Eur. J. Biochem. 263 6–13.
  • Tang, Y., Ghosal, S. and Roy, A. (2007). Nonparametric Bayesian estimation of positive false discovery rates. Biometrics 63 1126–1134, 1312.
  • Wang, L. and Dunson, D. B. (2011). Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Statist. 20 196–216. Supplementary material available online.
  • Wei, Z. and Li, H. (2007). A Markov random field model for network-based analysis of genomic data. Bioinformatics 23 1537–1544.
  • Wei, Z. and Li, H. (2008). A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data. Ann. Appl. Stat. 2 408–429.
  • Wei, P. and Pan, W. (2010). Network-based genomic discovery: Application and comparison of Markov random-field models. J. R. Stat. Soc. Ser. C. Appl. Stat. 59 105–125.
  • Wei, P. and Pan, W. (2012). Bayesian joint modeling of multiple gene networks and diverse genomic data to identify target genes of a transcription factor. Ann. Appl. Stat. 6 334–355.
  • Wichert, S., Fokianos, K. and Strimmer, K. (2004). Identifying periodically expressed transcripts in microarray time series data. Bioinformatics 20 5–20.
  • Xenarios, I., Salwínski, L., Duan, X. J., Higney, P., Kim, S.-M. and Eisenberg, D. (2002). DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30 303–305.
  • Yu, T. (2010). An exploratory data analysis method to reveal modular latent structures in high-throughput data. BMC Bioinformatics 11 440.
  • Zhao, Y., Kang, J. and Yu, T. (2014) Supplement to “A Bayesian nonparametric mixture model for selecting genes and gene subnetworks.” DOI:10.1214/14-AOAS719SUPP.
  • Zhou, B., Xu, W., Herndon, D., Tompkins, R., Davis, R., Xiao, W., Wong, W., Toner, M., Warren, H., Schoenfeld, D. et al. (2010). Analysis of factorial time-course microarrays with application to a clinical study of burn injury. Proc. Natl. Acad. Sci. USA 107 9923.

Supplemental materials

  • Supplementary material: Supplement to “A Bayesian nonparametric mixture model for selecting genes and gene subnetworks”. In this online supplemental article we provide (A) derivations of the proposed methods, (B) details of the main algorithms for posterior computations, (C) details of posterior inference for hyperparameters, (D) additional simulation studies and (E) sensitivity analysis.