The Annals of Applied Statistics

Multi-way blockmodels for analyzing coordinated high-dimensional responses

Edoardo M. Airoldi, Xiaopei Wang, and Xiaodong Lin

Full-text: Open access

Abstract

We consider the problem of quantifying temporal coordination between multiple high-dimensional responses. We introduce a family of multi-way stochastic blockmodels suited for this problem, which avoids preprocessing steps such as binning and thresholding commonly adopted for this type of data, in biology. We develop two inference procedures based on collapsed Gibbs sampling and variational methods. We provide a thorough evaluation of the proposed methods on simulated data, in terms of membership and blockmodel estimation, predictions out-of-sample and run-time. We also quantify the effects of censoring procedures such as binning and thresholding on the estimation tasks. We use these models to carry out an empirical analysis of the functional mechanisms driving the coordination between gene expression and metabolite concentrations during carbon and nitrogen starvation, in S. cerevisiae.

Article information

Source
Ann. Appl. Stat., Volume 7, Number 4 (2013), 2431-2457.

Dates
First available in Project Euclid: 23 December 2013

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1387823326

Digital Object Identifier
doi:10.1214/13-AOAS643

Mathematical Reviews number (MathSciNet)
MR3161729

Zentralblatt MATH identifier
1283.62215

Keywords
High dimensional data variational inference molecular biology yeast

Citation

Airoldi, Edoardo M.; Wang, Xiaopei; Lin, Xiaodong. Multi-way blockmodels for analyzing coordinated high-dimensional responses. Ann. Appl. Stat. 7 (2013), no. 4, 2431--2457. doi:10.1214/13-AOAS643. https://projecteuclid.org/euclid.aoas/1387823326


Export citation

References

  • Airoldi, E. M. (2007). Getting started in probabilistic graphical models. PLoS Computational Biology 3 e252.
  • Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research 9 1981–2014.
  • Airoldi, E. M., Huttenhower, C., Gresham, D., Lu, C., Caudy, A., Dunham, M., Broach, J. R., Botstein, D. and Troyanskaya, O. G. (2009). Predicting cellular growth from gene expression signatures. PLoS Computational Biology 5 e1000257.
  • Airoldi, E. M., Hashimoto, T. B., Brandt, N., Bahmani, T., Athanasiadou, N. and Gresham, D. J. (2013a). Coordinated dynamics of cell growth and transcription. Preprint.
  • Airoldi, E. M., Wang, X. and Lin, X. (2013b). Supplement to “Multi-way blockmodels for analyzing coordinated high-dimensional responses.” DOI:10.1214/13-AOAS643SUPP.
  • Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubinand, G. M. and Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The gene ontology consortium. Nature Genetics 25 25–29.
  • Azari, H. and Airoldi, E. M. (2012). Graphlet decomposition of a weighted network. Journal of Machine Learning Research, W&CP (AI&Stat) 22 54–63.
  • Blei, D. M. and Jordan, M. I. (2006). Variational inference for Dirichlet process mixtures. Bayesian Anal. 1 121–143 (electronic).
  • Blocker, A. W. and Meng, X. L. (2013). The perils of data pre-processing. Bernoulli 19 1176–1211.
  • Boyle, E. I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J. M. and Sherlock, G. (2004). GO::TermFinder—open source software for accessing gene ontology terms associated with a list of genes. Bioinformatics 20 3710–3715.
  • Bradley, P. H., Brauer, M. J., Rabinowitz, J. D. and Troyanskaya, O. G. (2009). Coordinated concentration changes of transcripts and metabolites in Saccharomyces cerevisiae. PLoS Computational Biology 5 e1000270.
  • Brauer, M. J., Huttenhower, C., Airoldi, E. M., Rosenstein, R., Matese, J. C., Gresham, D., Boer, V. M., Troyanskaya, O. G. and Botstein, D. (2008). Coordination of growth rate, cell cycle, stress response and metabolic activity in yeast. Molecular Biology of the Cell 19 352–367.
  • Brauer, M. J., Yuan, J., Bennett, B. D., Lu, W., Kimball, E., Botstein, D. and Rabinowitz, J. D. (2006). Conservation of the metabolomic response to starvation across two divergent microbes. Proc. Natl. Acad. Sci. USA 103 19302–19307.
  • Braun, M. and McAuliffe, J. (2010). Variational inference for large-scale models of discrete choice. J. Amer. Statist. Assoc. 105 324–335.
  • Chakrabarti, D., Papadimitriou, S., Modha, D. and Faloutsos, C. (2004). Fully automatic cross-associations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 10 79–88.
  • Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology 8 93–103.
  • Cherry, J. M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, B., Dunn, C., Dwight, S., Riles, L., Mortimer, R. K. and Botstein, D. (1997). Genetic and physical maps of Saccharomyces cerevisiae. Nature 387 67–73.
  • Cope, L., Zhong, X., Garrett, E. and Parmigiani, G. (2004). MergeMaid: R tools for merging and cross-study validation of gene expression data. Stat. Appl. Genet. Mol. Biol. 3 a29.
  • Fisher, R. A. (1915). Frequency distribution of the values of the correlation coefficient in samples of an indefinitely large population. Biometrika 10 507–521.
  • Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small sample. Metron 1 3–32.
  • Franks, A. M., Csárdi, G., Drummond, D. A. and Airoldi, E. M. (2012). Estimating a structured covariance matrix from multi-lab measurements in high-throughput biology. Preprint.
  • Gasch, A. P., Spellman, P. T., Kao, C. M., Carmel-Harel, O., Eisen, M. B., Storz, G., Botstein, D. and Brown, P. O. (2000). Genomic expression programs in the response of yeast cells to environmental changes. Molecular Biology of the Cell 11 4241–4257.
  • Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2009). Statistical models of networks. Foundations and Trends in Machine Learning 2 1–117.
  • Jordan, M., Ghahramani, Z., Jaakkola, T. and Saul, L. (1999). Introduction to variational methods for graphical models. Machine Learning 37 183–233.
  • Joutard, C., Airoldi, E. M., Fienberg, S. E. and Love, T. M. (2008). Discovery of latent patterns with hierarchical Bayesian mixed-membership models and the issue of model choice. In Data Mining Patterns, New Methods and Applications. IGI Global, Hershey, PA.
  • Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27–30.
  • Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958–966.
  • Lu, R., Markowetz, F., Unwin, R. D., Leek, J. T., Airoldi, E. M., MacArthur, B. D., Lachmann, A., Rozov, R., Ma’ayan, A., Boyer, L. A., Troyanskaya, O. G., Whetton, A. D. and Lemischka, I. R. (2009). Systems-level dynamic analyses of fate change in murine embryonic stem cells. Nature 462 358–362.
  • Markowetz, F., Airoldi, E. M., Lemischka, I. R. and Troyanskaya, O. G. (2009). Mapping dynamic histone acetylation patterns to gene expression in nanog-depleted murine embryonic stem cells. Unpublished manuscript.
  • Nowicki, K. and Snijders, T. A. B. (2001). Estimation and prediction for stochastic blockstructures. J. Amer. Statist. Assoc. 96 1077–1087.
  • Rohe, K. and Yu, B. (2012). Co-clustering for directed graphs: The stochastic co-blockmodel and a spectral algorithm. Available at arXiv:1204.2296.
  • SGD project. Saccharomyces genome database. Available at http://www.yeastgenome.org/.
  • Slavov, N., Airoldi, E. M., van Oudenaarden, A. and Botstein, D. (2013). A conserved cell growth cycle can account for the environmental stress responses of divergent eukaryotes. Molecular Biology of the Cell 23 1986–1997.
  • Snijders, T. A. B. and Nowicki, K. (1997). Estimation and prediction for stochastic blockmodels for graphs with latent block structure. J. Classification 14 75–100.
  • Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components—an alternative to reversible jump methods. Ann. Statist. 28 40–74.
  • Titterington, D. M., Smith, A. F. M. and Makov, U. E. (1985). Statistical Analysis of Finite Mixture Distributions. Wiley, Chichester.
  • Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. and Botstein, D. (2003). A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae). Proc. Natl. Acad. Sci. USA 100 10623–10628.
  • Tu, B. P., Kudlicki, A., Rowicka, M. and McKnight, S. L. (2005). Logic of the yeast metabolic cycle: Temporal compartmentalization of cellular processes. Science 310 1152–1158.
  • Turnbull, B. W. (1976). The empirical distribution function with arbitrarily grouped, censored and truncated data. J. Roy. Statist. Soc. Ser. B 38 290–295.
  • Vardi, Y. (1985). Empirical distributions in selection bias models. Ann. Statist. 13 178–205.

Supplemental materials

  • Supplementary material: Supplement to “Multi-way blockmodels for analyzing coordinated high-dimensional responses”. We provide additional supporting plots that show both good and poor performance of the Hill estimator for the index of regular variation in a variety of examples.