## The Annals of Applied Statistics

### A penalized regression model for the joint estimation of eQTL associations and gene network structure

#### Abstract

In this work, we present a new approach for jointly performing eQTL mapping and gene network inference while encouraging a transfer of information between the two tasks. We address this problem by formulating it as a multiple-output regression task in which we aim to learn the regression coefficients while simultaneously estimating the conditional independence relationships among the set of response variables. The approach we develop uses structured sparsity penalties to encourage the sharing of information between the regression coefficients and the output network in a mutually beneficial way. Our model, inverse-covariance-fused lasso, is formulated as a biconvex optimization problem that we solve via alternating minimization. We derive new, efficient optimization routines to solve each convex sub-problem that are based on extensions of state-of-the-art methods. Experiments on both simulated data and a yeast eQTL dataset demonstrate that our approach outperforms a large number of existing methods on the recovery of the true sparse structure of both the eQTL associations and the gene network. We also apply our method to a human Alzheimer’s disease dataset and highlight some results that support previous discoveries about the disease.

#### Article information

Source
Ann. Appl. Stat., Volume 13, Number 1 (2019), 248-270.

Dates
Revised: May 2018
First available in Project Euclid: 10 April 2019

https://projecteuclid.org/euclid.aoas/1554861648

Digital Object Identifier
doi:10.1214/18-AOAS1186

Mathematical Reviews number (MathSciNet)
MR3937428

Zentralblatt MATH identifier
07057427

#### Citation

Marchetti-Bowick, Micol; Yu, Yaoliang; Wu, Wei; Xing, Eric P. A penalized regression model for the joint estimation of eQTL associations and gene network structure. Ann. Appl. Stat. 13 (2019), no. 1, 248--270. doi:10.1214/18-AOAS1186. https://projecteuclid.org/euclid.aoas/1554861648

#### References

• Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T. et al. (2000). Gene ontology: Tool for the unification of biology. Nat. Genet. 25 25–29.
• Banerjee, O., El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 9 485–516.
• Banerjee, S., Yandell, B. S. and Yi, N. (2008). Bayesian quantitative trait loci mapping for multiple traits. Genetics 179 2275–2289.
• Barabasi, A.-L. and Oltvai, Z. N. (2004). Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 5 101–113.
• Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York.
• Brem, R. B. and Kruglyak, L. (2005). The landscape of genetic complexity across 5700 gene expression traits in yeast. Proc. Natl. Acad. Sci. USA 102 1572–1577.
• Breunig, J. S., Hackett, S. R., Rabinowitz, J. D. and Kruglyak, L. (2014). Genetic basis of metabolome variation in yeast. PLoS Genet. 10 e1004142.
• Chen, X., Kim, S., Lin, Q., Carbonell, J. G. and Xing, E. P. (2010). Graph-structured multi-task regression and an efficient optimization method for general fused lasso. ArXiv preprint. Available at arXiv:1005.3579.
• Danaher, P., Wang, P. and Witten, D. M. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. J. R. Stat. Soc. Ser. B. Stat. Methodol. 76 373–397.
• Dempster, A. P. (1972). Covariance selection. Biometrics 28 157–175.
• Flutre, T., Wen, X., Pritchard, J. and Stephens, M. (2013). A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. 9 e1003486.
• Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432–441.
• Gardner, T. S. and Faith, J. J. (2005). Reverse-engineering transcription control networks. Phys. Life Rev. 2 65–88.
• Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27–30.
• Kim, S., Sohn, K.-A. and Xing, E. P. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25 i204–i212.
• Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 5 Article ID e1000587.
• Kim, S. and Xing, E. P. (2012). Tree-guided group lasso for multi-response regression with structured sparsity, with an application to EQTL mapping. Ann. Appl. Stat. 6 1095–1117.
• Lee, W. and Liu, Y. (2012). Simultaneous multiple response regression and inverse covariance matrix estimation via penalized Gaussian maximum likelihood. J. Multivariate Anal. 111 241–255.
• Malik, M., Chiles III, J., Xi, H. S., Medway, C., Simpson, J., Potluri, S., Howard, D., Liang, Y., Paumi, C. M., Mukherjee, S. et al. (2015). Genetics of CD33 in Alzheimer’s disease and acute myeloid leukemia. Hum. Mol. Genet. 24 3557–3570.
• Marchetti-Bowick, M., Yu, Y., Wu, W. and Xing, E. P. (2019). Supplement to “A penalized regression model for the joint estimation of eQTL associations and gene network structure.” DOI:10.1214/18-AOAS1186SUPP.
• Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436–1462.
• Michaelson, J. J., Alberts, R., Schughart, K. and Beyer, A. (2010). Data-driven assessment of eQTL mapping methods. BMC Genomics 11 502.
• Nica, A. C., Montgomery, S. B., Dimas, A. S., Stranger, B. E., Beazley, C., Barroso, I. and Dermitzakis, E. T. (2010). Candidate causal regulatory effects by integration of expression QTLs with complex trait genetic associations. PLoS Genet. 6 e1000895.
• Peng, J., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735–746.
• Rai, P., Kumar, A. and Daume, H. (2012). Simultaneously leveraging output and task structures for multiple-output regression. In Advances in Neural Information Processing Systems (NIPS) 3185–3193.
• Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. (2011). High-dimensional covariance estimation by minimizing $\ell_{1}$-penalized log-determinant divergence. Electron. J. Stat. 5 935–980.
• Roberts, C. J. and Selker, E. U. (1995). Mutations affecting the biosynthesis of S-adenosylmethionine cause reduction of DNA methylation in Neurospora crassa. Nucleic Acids Res. 23 4818–4826.
• Rockman, M. V. and Kruglyak, L. (2006). Genetics of global gene expression. Nat. Rev. Genet. 7 862–872.
• Rothman, A. J., Levina, E. and Zhu, J. (2010). Sparse multivariate regression with covariance estimation. J. Comput. Graph. Statist. 19 947–962.
• Sohn, K.-A. and Kim, S. (2012). Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. In International Conference on Artificial Intelligence and Statistics (AISTATS) 1081–1089.
• Stephens, M. (2013). A unified framework for association analysis with multiple related phenotypes. PLoS ONE 8 e65245.
• Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102 15545–15550.
• Tan, K. M., London, P., Mohan, K., Lee, S.-I., Fazel, M. and Witten, D. (2014). Learning graphical models with hubs. J. Mach. Learn. Res. 15 3297–3331.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Wytock, M. and Kolter, Z. (2013). Sparse Gaussian conditional random fields: Algorithms, theory, and application to energy forecasting. In International Conference on Machine Learning (ICML) 1265–1273.
• Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 68 49–67.
• Yuan, X.-T. and Zhang, T. (2014). Partial Gaussian graphical model estimation. IEEE Trans. Inform. Theory 60 1673–1687.
• Zhang, Y. and Yeung, D.-Y. (2010). A convex formulation for learning task relationships in multi-task learning. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI’10) 733–742. AUAI Press, Arlington, VA.
• Zhang, B., Gaiteri, C., Bodea, L.-G., Wang, Z., McElwee, J., Podtelezhnikov, A. A., Zhang, C., Xie, T., Tran, L., Dobrin, R. et al. (2013). Integrated systems approach identifies genetic nodes and networks in late-onset Alzheimer’s disease. Cell 153 707–720.

#### Supplemental materials

• Supplement to “A penalized regression model for the joint estimation of eQTL associations and gene network structure.”. We provide a supplementary document [Marchetti-Bowick et al. (2019)] that contains additional details about the optimization algorithm and additional results for both the synthetic and real data experiments.