The Annals of Applied Statistics

Bayesian hidden Markov tree models for clustering genes with shared evolutionary history

Yang Li, Shaoyang Ning, Sarah E. Calvo, Vamsi K. Mootha, and Jun S. Liu

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Determination of functions for poorly characterized genes is crucial for understanding biological processes and studying human diseases. Functionally associated genes are often gained and lost together through evolution. Therefore identifying co-evolution of genes can predict functional gene-gene associations. We describe here the full statistical model and computational strategies underlying the original algorithm CLustering by Inferred Models of Evolution (CLIME 1.0) recently reported by us (Cell 158 (2014) 213–225). CLIME 1.0 employs a mixture of tree-structured hidden Markov models for gene evolution process, and a Bayesian model-based clustering algorithm to detect gene modules with shared evolutionary histories (termed evolutionary conserved modules, or ECMs). A Dirichlet process prior was adopted for estimating the number of gene clusters and a Gibbs sampler was developed for posterior sampling. We further developed an extended version, CLIME 1.1, to incorporate the uncertainty on the evolutionary tree structure. By simulation studies and benchmarks on real data sets, we show that CLIME 1.0 and CLIME 1.1 outperform traditional methods that use simple metrics (e.g., the Hamming distance or Pearson correlation) to measure co-evolution between pairs of genes.

Article information

Ann. Appl. Stat., Volume 13, Number 1 (2019), 606-637.

Received: June 2018
Revised: August 2018
First available in Project Euclid: 10 April 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Co-evolution Dirichlet process mixture model evolutionary history gene function prediction tree-structured hidden Markov model


Li, Yang; Ning, Shaoyang; Calvo, Sarah E.; Mootha, Vamsi K.; Liu, Jun S. Bayesian hidden Markov tree models for clustering genes with shared evolutionary history. Ann. Appl. Stat. 13 (2019), no. 1, 606--637. doi:10.1214/18-AOAS1208.

Export citation


  • Aldous, D. J. (1985). Exchangeability and related topics. In École D’été de Probabilités de Saint-Flour, XIII—1983. Lecture Notes in Math. 1117 1–198. Springer, Berlin.
  • Balsa, E., Marco, R., Perales-Clemente, E., Szklarczyk, R., Calvo, E., Landázuri, M. O. and Enríquez, J. A. (2012). NDUFA4 is a subunit of complex IV of the mammalian electron transport chain. Cell Metab. 16 378–386.
  • Barker, D., Meade, A. and Pagel, M. (2007). Constrained models of evolution lead to improved prediction of functional linkage from correlated gain and loss of genes. Bioinformatics 23 14–20.
  • Barker, D. and Pagel, M. (2005). Predicting functional gene links from phylogenetic-statistical analyses of whole genomes. PLoS Comput. Biol. 1 e3.
  • Bick, A. G., Calvo, S. E. and Mootha, V. K. (2012). Evolutionary diversity of the mitochondrial calcium uniporter. Science 336 886.
  • Chen, R. and Liu, J. S. (1996). Predictive updating methods with application to Bayesian classification. J. Roy. Statist. Soc. Ser. B 58 397–415.
  • Chib, S. (1995). Marginal likelihood from the Gibbs output. J. Amer. Statist. Assoc. 90 1313–1321.
  • Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230.
  • Galperin, M. Y. and Koonin, E. V. (2010). From complete genome sequence to “complete” understanding? Trends Biotechnol. 28 398–406.
  • Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. J. Amer. Statist. Assoc. 85 398–409.
  • Glazko, G. V. and Mushegian, A. R. (2004). Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns. Genome Biol. 5 R32.
  • Guindon, S., Dufayard, J.-F., Lefort, V., Anisimova, M., Hordijk, W. and Gascuel, O. (2010). New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0. Syst. Biol. 59 307–321.
  • Hamming, R. W. (1950). Error detecting and error correcting codes. Bell Syst. Tech. J. 29 147–160.
  • Hamosh, A., Scott, A. F., Amberger, J. S., Bocchini, C. A. and McKusick, V. A. (2005). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res. 33 D514–D517.
  • Horani, A., Ferkol, T. W., Dutcher, S. K. and Brody, S. L. (2016). Genetics and biology of primary ciliary dyskinesia. Paediatr. Respir. Rev. 18 18–24.
  • Hubert, L. and Arabie, P. (1985). Comparing partitions. J. Classification 2 193–218.
  • Inglis, P. N., Boroevich, K. A. and Leroux, M. R. (2006). Piecing together a ciliome. Trends Genet. 22 491–500.
  • Jim, K., Parmar, K., Singh, M. and Tavazoie, S. (2004). A cross-genomic approach for systematic mapping of phenotypic traits to genes. Genome Res. 14 109–115.
  • Kensche, P. R., van Noort, V., Dutilh, B. E. and Huynen, M. A. (2008). Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. J. R. Soc. Interface 5 151–170.
  • Li, J. B., Gerdes, J. M., Haycraft, C. J., Fan, Y., Teslovich, T. M., May-Simera, H., Li, H., Blacque, O. E., Li, L., Leitch, C. C. et al. (2004). Comparative genomics identifies a flagellar and basal body proteome that includes the BBS5 human disease gene. Cell 117 541–552.
  • Li, Y., Calvo, S. E., Gutman, R., Liu, J. S. and Mootha, V. K. (2014). Expansion of biological pathways based on evolutionary inference. Cell 158 213–225.
  • Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958–966.
  • Liu, J. S. (2008). Monte Carlo Strategies in Scientific Computing. Springer Series in Statistics. Springer, New York.
  • Liu, J. S., Wong, W. H. and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika 81 27–40.
  • Mimaki, M., Wang, X., McKenzie, M., Thorburn, D. R. and Ryan, M. T. (2012). Understanding mitochondrial complex I assembly in health and disease. Biochim. Biophys. Acta, Bioenerg. 1817 851–862.
  • Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249–265.
  • Ogilvie, I., Kennaway, N. G. and Shoubridge, E. A. (2005). A molecular chaperone for mitochondrial complex I assembly is mutated in a progressive encephalopathy. J. Clin. Invest. 115 2784–2792.
  • Pagel, M. and Meade, A. (2007). BayesTraits. Computer program and documentation. Available at
  • Pagliarini, D. J., Calvo, S. E., Chang, B., Sheth, S. A., Vafai, S. B., Ong, S. E., Walford, G. A. et al. (2008). A mitochondrial protein compendium elucidates complex I disease biology. Cell 134 112–123.
  • Pellegrini, M., Marcotte, E. M., Thompson, M. J., Eisenberg, D. and Yeates, T. O. (1999). Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96 4285–4288.
  • Pitman, J. (1996). Some developments of the Blackwell–MacQueen urn scheme. In Statistics, Probability and Game Theory. Institute of Mathematical Statistics Lecture Notes—Monograph Series 30 245–267. IMS, Hayward, CA.
  • Ronquist, F. and Huelsenbeck, J. P. (2003). MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19 1572–1574.
  • Tabach, Y., Billi, A. C., Hayes, G. D., Newman, M. A., Zuk, O., Gabel, H., Kamath, R., Yacoby, K., Chapman, B., Garcia, S. M. et al. (2013). Identification of small RNA pathway genes using patterns of phylogenetic conservation and divergence. Nature 493 694–698.
  • Trachana, K., Larsson, T. A., Powell, S., Chen, W.-H., Doerks, T., Muller, J. and Bork, P. (2011). Orthology prediction methods: A quality assessment using curated protein families. BioEssays 33 769–780.
  • Vert, J.-P. (2002). A tree kernel to analyse phylogenetic profiles. Bioinformatics 18 S276–S284.
  • Von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P. and Snel, B. (2003). STRING: A database of predicted functional associations between proteins. Nucleic Acids Res. 31 258–261.
  • Zhou, Y., Wang, R., Li, L., Xia, X. and Sun, Z. (2006). Inferring functional linkages between proteins from evolutionary scenarios. J. Mol. Biol. 359 1150–1159.