The Annals of Applied Statistics

Structured subcomposition selection in regression and its application to microbiome data analysis

Tao Wang and Hongyu Zhao

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Compositional data arise naturally in many practical problems and the analysis of such data presents many statistical challenges, especially in high dimensions. In this article, we consider the problem of subcomposition selection in regression with compositional covariates, where the relationships among the covariates can be represented by a tree with leaf nodes corresponding to covariates. Assuming that the tree structure is available as prior knowledge, we adopt a symmetric version of the linear log contrast model, and propose a tree-guided regularization method for this structured subcomposition selection. Our method is based on a novel penalty function that incorporates the tree structure information node-by-node, encouraging the selection of subcompositions at subtree levels. We show that this optimization problem can be formulated as a generalized lasso problem, the solution of which can be computed efficiently using existing algorithms. An application to a human gut microbiome study and simulations are presented to compare the performance of the proposed method with an $l_{1}$ regularization method where the tree structure information is not utilized.

Article information

Ann. Appl. Stat., Volume 11, Number 2 (2017), 771-791.

Received: November 2015
Revised: December 2016
First available in Project Euclid: 20 July 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Compositional data analysis feature selection homogeneity log ratio transformations penalized regression phylogenetic tree the lasso


Wang, Tao; Zhao, Hongyu. Structured subcomposition selection in regression and its application to microbiome data analysis. Ann. Appl. Stat. 11 (2017), no. 2, 771--791. doi:10.1214/16-AOAS1017.

Export citation


  • Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman & Hall, London.
  • Aitchison, J. and Bacon-Shone, J. (1984). Log contrast models for experiments with mixtures. Biometrika 71 323–330.
  • Akaike, H. (1998). Selected Papers of Hirotugu Akaike. Springer, New York.
  • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373–384.
  • Caporaso, J. G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F. D., Costello, E. K., Fierer, N., Pena, A. G., Goodrich, J. K., Gordon, J. I. et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7 335–336.
  • Chen, J., Bushman, F. D., Lewis, J. D., Wu, G. D. and Li, H. (2013). Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis. Biostat. 14 244–258.
  • Fan, Y. and Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. J. R. Stat. Soc. Ser. B. Stat. Methodol. 75 531–552.
  • Fleissner, C. K., Huebel, N., Abd El-Bary, M. M., Loh, G., Klaus, S. and Blaut, M. (2010). Absence of intestinal microbiota does not protect mice from diet-induced obesity. Br. J. Nutr. 104 919–929.
  • Garcia, T. P., Müller, S., Carroll, R. J. and Walzem, R. L. (2014). Identification of important regressor groups, subgroups and individuals via regularization methods: Application to gut microbiome data. Bioinformatics 30 831–837.
  • Gill, S. R., Pop, M., DeBoy, R. T., Eckburg, P. B., Turnbaugh, P. J., Samuel, B. S., Gordon, J. I., Relman, D. A., Fraser-Liggett, C. M. and Nelson, K. E. (2006). Metagenomic analysis of the human distal gut microbiome. Science 312 1355–1359.
  • Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 55–67.
  • Jenatton, R., Audibert, J.-Y. and Bach, F. (2011). Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12 2777–2824.
  • Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2011). Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12 2297–2334.
  • Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 5 e1000587.
  • Kim, S. and Xing, E. P. (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to EQTL mapping. Ann. Appl. Stat. 6 1095–1117.
  • Knights, D., Parfrey, L. W., Zaneveld, J., Lozupone, C. and Knight, R. (2011). Human-associated microbial signatures: Examining their predictive value. Cell Host & Microbe 10 292–296.
  • Lee, J. D., Sun, Y. and Taylor, J. E. (2015). On model selection consistency of regularized M-estimators. Electron. J. Stat. 9 608–642.
  • Ley, R. E. (2010). Obesity and the human microbiome. Curr. Opin Gastroenterol. 26 5–11.
  • Li, H. (2015). Microbiome, metagenomics and high-dimensional compositional data analysis. Ann. Rev. Stat. Appl. 2 73–94.
  • Lin, W., Shi, P., Feng, R. and Li, H. (2014). Variable selection in regression with compositional covariates. Biometrika 101 785–797.
  • McMurdie, P. J. and Holmes, S. (2013). phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data. PLoS ONE 8 e61217.
  • Navas-Molina, J. A., Peralta-Sánchez, J. M., González, A., McMurdie, P. J., Vázquez-Baeza, Y., Xu, Z., Ursell, L. K., Lauber, C., Zhou, H., Song, S. J., Huntley, J., Ackermann, G. L., Berg-Lyons, D., Holmes, S., Caporaso, J. G. and Knight, R. (2013). Advancing our understanding of the human microbiome using QIIME. Methods Enzymol. 531 371–444.
  • Rota, G.-C. (1964). The number of partitions of a set. Amer. Math. Monthly 71 498–504.
  • Scealy, J. L. and Welsh, A. H. (2011). Regression for compositional data by using distributions defined on the hypersphere. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 351–375.
  • Scealy, J. L., de Caritat, P., Grunsky, E. C., Tsagris, M. T. and Welsh, A. H. (2015). Robust principal component analysis for power transformed compositional data. J. Amer. Statist. Assoc. 110 136–148.
  • Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464.
  • Shi, P., Zhang, A. and Li, H. (2016). Regression analysis for microbiome compositional data. Ann. Appl. Stat. 10 1019–1040.
  • St. John, R. C. (1984). Experiments with mixtures, ill-conditioning, and ridge regression. J. Qual. Technol. 16 81–96.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc., B 58 267–288.
  • Tibshirani, R. J. and Taylor, J. (2011). The solution path of the generalized lasso. Ann. Statist. 39 1335–1371.
  • Turnbaugh, P. J., Ley, R. E., Mahowald, M. A., Magrini, V., Mardis, E. R. and Gordon, J. I. (2006). An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444 1027–1031.
  • Turnbaugh, P. J., Bäckhed, F., Fulton, L. and Gordon, J. I. (2008). Diet-induced obesity is linked to marked but reversible alterations in the mouse distal gut microbiome. Cell Host & Microbe 3 213–223.
  • Wu, G. D., Chen, J., Hoffmann, C., Bittinger, K., Chen, Y.-Y., Keilbaugh, S. A., Bewtra, M., Knights, D., Walters, W. A., Knight, R. et al. (2011). Linking long-term dietary patterns with gut microbial enterotypes. Science 334 105–108.
  • Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
  • Zhang, H., DiBaise, J. K., Zuccolo, A., Kudrna, D., Braidotti, M., Yu, Y., Parameswaran, P., Crowell, M. D., Wing, R., Rittmann, B. E. et al. (2009). Human gut microbiota in obesity and after gastric bypass. Proc. Natl. Acad. Sci. USA 106 2365–2370.
  • Zhang, C., Zhang, M., Wang, S., Han, R., Cao, Y., Hua, W., Mao, Y., Zhang, X., Pang, X., Wei, C. et al. (2010). Interactions between gut microbiota, host genetics and diet relevant to development of metabolic syndromes in mice. ISME J. 4 232–241.
  • Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468–3497.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.