The Annals of Applied Statistics

Modeling association in microbial communities with clique loglinear models

Adrian Dobra, Camilo Valdes, Dragana Ajdic, Bertrand Clarke, and Jennifer Clarke

Full-text: Open access


There is a growing awareness of the important roles that microbial communities play in complex biological processes. Modern investigation of these often uses next generation sequencing of metagenomic samples to determine community composition. We propose a statistical technique based on clique loglinear models and Bayes model averaging to identify microbial components in a metagenomic sample at various taxonomic levels that have significant associations. We describe the model class, a stochastic search technique for model selection, and the calculation of estimates of posterior probabilities of interest. We demonstrate our approach using data from the Human Microbiome Project and from a study of the skin microbiome in chronic wound healing. Our technique also identifies significant dependencies among microbial components as evidence of possible microbial syntrophy.

Article information

Ann. Appl. Stat., Volume 13, Number 2 (2019), 931-957.

Received: January 2018
Revised: November 2018
First available in Project Euclid: 17 June 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Contingency tables graphical models model selection microbiome next generation sequencing


Dobra, Adrian; Valdes, Camilo; Ajdic, Dragana; Clarke, Bertrand; Clarke, Jennifer. Modeling association in microbial communities with clique loglinear models. Ann. Appl. Stat. 13 (2019), no. 2, 931--957. doi:10.1214/18-AOAS1229.

Export citation


  • Abramowitz, M. and Stegun, I. A. (1972). Handbook of mathematical functions with formulas, graphs, and mathematical tables. U.S. Dept. of Commerce: US GPO, Washington, DC.
  • Barry, D. and Hartigan, J. A. (1992). Product partition models for change point problems. Ann. Statist. 20 260–279.
  • Berger, J. O., Ghosh, J. K. and Mukhopadhyay, N. (2003). Approximations and consistency of Bayes factors as model dimension grows. J. Statist. Plann. Inference 112 241–258.
  • Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data. J. Amer. Statist. Assoc. 107 362–377.
  • Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (2007). Discrete Multivariate Analysis: Theory and Practice. Springer, New York.
  • Canale, A. and Dunson, D. B. (2011). Bayesian kernel mixtures for counts. J. Amer. Statist. Assoc. 106 1528–1539.
  • Carvalho, C. M. and Scott, J. G. (2009). Objective Bayesian model selection in Gaussian graphical models. Biometrika 96 497–512.
  • Charuvaka, A. and Rangwala, H. (2011). Evaluation of short read metagenomic assembly. BMC Genomics 12 S8.
  • Clarke, B., Valdes, C., Dobra, A. and Clarke, J. (2015). A Bayes testing approach to metagenomic profiling in bacteria. Stat. Interface 8 173–185.
  • Dawid, A. P. and Lauritzen, S. L. (1993). Hyper-Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21 1272–1317.
  • de Kievit, T. and Iglewski, B. (2000). Bacterial quorum sensing in pathogenic relationships. Infect. Immun. 68 4839–4849.
  • Dellaportas, P. and Forster, J. J. (1999). Markov chain Monte Carlo model determination for hierarchical and graphical log-linear models. Biometrika 86 615–633.
  • Dellaportas, P. and Tarantola, C. (2005). Model determination for categorical data with factor level merging. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 269–283.
  • Dobra, A. (2009). Variable selection and dependency networks for genomewide data. Biostatistics 10 621–639.
  • Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their application to modeling functional disability data. Ann. Appl. Stat. 5 969–993.
  • Dobra, A. and Massam, H. (2010). The mode oriented stochastic search (MOSS) algorithm for log-linear models with conjugate priors. Stat. Methodol. 7 240–253.
  • Dobra, A., Valdes, C., Ajdic, D., Clarke, B. and Clarke, J. (2019). Supplement to “Modeling association in microbial communities with clique loglinear models.” DOI:10.1214/18-AOAS1229SUPP.
  • Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042–1051.
  • Edwards, D. and Havránek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrika 72 339–351.
  • Fettweis, J., Serrano, M., Girerd, P., Jefferson, K. and Buck, G. (2012). A new era of the vaginal microbiome: Advances using next generation sequencing. Chem. Biodivers. 9 965–976.
  • Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference 137 3430–3445.
  • Fierer, N., Lauber, C., Zhou, N., McDonald, D., Costello, E. and Knight, R. (2010). Forensic identification using skin bacterial communities. Proc. Natl. Acad. Sci. USA 107 6477–6481.
  • NIH HMP Working Group, Peterson, J., Garges, S., Giovanni, M., McInnes, P., Wang, L., Schloss, J. A., Bonazzi, V., McEwen, J. E. et al. (2009). The NIH human microbiome project. Genome Res. 19 2317–2323.
  • Hankin, R. K. S. (2006). Additive integer partitions in R. J. Stat. Softw. 16. Code Snippet 1.
  • Hans, C., Dobra, A. and West, M. (2007). Shotgun stochastic search for “large $p$” regression. J. Amer. Statist. Assoc. 102 507–516.
  • Hasman, H., Saputra, D., Sicheritz-Ponten, T., Lund, O., Svendsen, C. A., Frimodt-Møller, N. and Aarestrup, F. M. (2014). Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples. Eur. J. Clin. Microbiol. Infect. Dis. 52 139–146.
  • Hoffmann, C., Dollive, S., Grunberg, S., Chen, J., Li, H., Wu, G., Lewis, J. and Bushman, F. (2013). Archaea and fungi of the human gut microbiome: Correlations with diet and bacterial residents. PLoS ONE 8 e66019.
  • Huang, B., Fettweis, J., Brooks, J. P., Jefferson, K. and Buck, G. (2014). The changing landscape of the vaginal microbiome. Clin. Lab. Med. 34 747–761.
  • Johndrow, J. E., Bhattacharya, A. and Dunson, D. B. (2017). Tensor decompositions and sparse log-linear models. Ann. Statist. 45 1–38.
  • Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20 388–400.
  • Koch, G., Nadal-Jimenez, P., Reis, C., Muntendam, R., Bokhove, M., Melillo, E., Dijkstra, B., Cool, R. and Quax, W. (2014). Reducing virulence of the human pathogen Burkholderia by altering the substrate specificity of the quorum-quenching acylase PvdQ. Proc. Natl. Acad. Sci. USA 111 1568–1573.
  • Kunihama, T. and Dunson, D. B. (2013). Bayesian modeling of temporal dependence in large sparse contingency tables. J. Amer. Statist. Assoc. 108 1324–1338.
  • Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nat. Methods 9 357–359.
  • Lauritzen, S. L. (1996). Graphical Models. Oxford Statistical Science Series 17. Clarendon Press, Oxford.
  • Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. J. Comput. Graph. Statist. 20 140–157.
  • Letac, G. and Massam, H. (2012). Bayes factors and the geometry of discrete hierarchical loglinear models. Ann. Statist. 40 861–890.
  • Levy, R. and Borenstein, E. (2013). Metabolic modeling of species interaction in the human microbiome elucidates community-level assembly rules. Proc. Natl. Acad. Sci. USA 110 12804–12809.
  • Lovato, P. (2015). Bag of words approaches for Bioinformatics Ph. D. thesis, Dept. Informatics, Univ. Verona.
  • Madigan, D., Gavrin, J. and Raftery, A. E. (1995). Eliciting prior information to enhance the predictive performance of Bayesian graphical models. Comm. Statist. Theory Methods 24 2271–2292.
  • Madigan, D. and Raftery, A. (1994). Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Amer. Statist. Assoc. 89 1535–1546.
  • Madigan, D. and York, J. (1995). Bayesian graphical models for discrete data. Int. Stat. Rev. 63 215–232.
  • Madigan, D. and York, J. C. (1997). Bayesian methods for estimation of the size of a closed population. Biometrika 84 19–31.
  • Markowitz, V. M., Chen, I. M., Palaniappan, K., Chu, K., Szeto, E., Pillay, M., Ratner, A., Huang, J., Woyke, T. et al. (2014). IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 42 D560–D567.
  • Massam, H., Liu, J. and Dobra, A. (2009). A conjugate prior for discrete hierarchical log-linear models. Ann. Statist. 37 3431–3467.
  • Minot, S., Bryson, A., Chehoud, C., Wu, G., Lewis, J. and Bushman, F. (2013). Rapid evolution of the human gut virome. Proc. Natl. Acad. Sci. USA 110 12450–12455.
  • Mohammadi, A., Abegaz, F., van den Heuvel, E. and Wit, E. C. (2017). Bayesian modelling of Dupuytren disease by using Gaussian copula graphical models. J. R. Stat. Soc. Ser. C. Appl. Stat. 66 629–645.
  • Mukherjee, C. and Rodriguez, A. (2016). GPU-powered shotgun stochastic search for Dirichlet process mixtures of Gaussian graphical models. J. Comput. Graph. Statist. 25 762–788.
  • Nguyen, N.-P., Warnow, T., Pop, M. and White, B. (2016). A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. NPJ Biofilms and Microbiomes 2 16004.
  • Onorante, L. and Raftery, A. E. (2016). Dynamic model averaging in large model spaces using dynamic Occam’s window. Eur. Econ. Rev. 81 2–14.
  • Ranjan, R., Rani, A., Metwally, A., McGee, H. S. and Perkins, D. L. (2016). Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing. Biochem. Biophys. Res. Commun. 469 967–977.
  • Robinson, M. D., McCarthy, D. J. and Smyth, D. K. (2010). edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 139–140.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Smith, K., Collier, A., Townsend, E. M., O’Donnell, L. E., Bal, A. M., Butcher, J., Mackay, W. G., Ramage, G. and Williams, C. (2016). One step closer to understanding the role of bacteria in diabetic foot ulcers: Characterising the microbiome of ulcers. BMC Microbiol. 16 54.
  • Tarantola, C. (2004). MCMC model determination for discrete graphical models. Stat. Model. 4 39–61.
  • Thoendel, M., Jeraldo, P. R., Greenwood-Quaintance, K. E., Yao, J. Z., Chia, N., Hanssen, A. D., Abdel, M. P. and Patel, R. (2016). Comparison of microbial DNA enrichment tools for metagenomic whole genome sequencing. J. Microbiol. Methods 127 141–145.
  • Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Wiley, Chichester.
  • Zhao, J., Schloss, P., Kalikin, L., Carmody, L., Foster, B., Petrosino, J., Cavalcoli, J., VanDevanter, D., Murray, S. et al. (2012). Decade-long bacterial community dynamics in cystic fibrosis airway. Proc. Natl. Acad. Sci. USA 109 5809–5814.
  • Zhou, J., Bhattacharya, A., Herring, A. H. and Dunson, D. B. (2015). Bayesian factorizations of big sparse tensors. J. Amer. Statist. Assoc. 110 1562–1576.
  • Zhou, J., Herring, A. H., Bhattacharya, A., Olshan, A. F., Dunson, D. B. and The National Birth Defects Prevention Study (2016). Nonparametric Bayes modeling for case control studies with many predictors. Biometrics 72 184–192.

Supplemental materials

  • Additional proofs, maps, figures and tables. In this online supplementary material, we describe the data that were used. We also present the computational experiments performed, the details of the simulations, and further details on the software that was developed in this article.