The Annals of Applied Statistics

Structured subcomposition selection in regression and its application to microbiome data analysis

Tao Wang and Hongyu Zhao

Compositional data arise naturally in many practical problems and the analysis of such data presents many statistical challenges, especially in high dimensions. In this article, we consider the problem of subcomposition selection in regression with compositional covariates, where the relationships among the covariates can be represented by a tree with leaf nodes corresponding to covariates. Assuming that the tree structure is available as prior knowledge, we adopt a symmetric version of the linear log contrast model, and propose a tree-guided regularization method for this structured subcomposition selection. Our method is based on a novel penalty function that incorporates the tree structure information node-by-node, encouraging the selection of subcompositions at subtree levels. We show that this optimization problem can be formulated as a generalized lasso problem, the solution of which can be computed efficiently using existing algorithms. An application to a human gut microbiome study and simulations are presented to compare the performance of the proposed method with an $l_{1}$ regularization method where the tree structure information is not utilized.

Ann. Appl. Stat., Volume 11, Number 2 (2017), 771-791.

Received: November 2015
Revised: December 2016
First available in Project Euclid: 20 July 2017

Compositional data analysis feature selection homogeneity log ratio transformations penalized regression phylogenetic tree the lasso


Wang, Tao; Zhao, Hongyu. Structured subcomposition selection in regression and its application to microbiome data analysis. Ann. Appl. Stat. 11 (2017), no. 2, 771--791. doi:10.1214/16-AOAS1017.

