## The Annals of Applied Statistics

### Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates

#### Abstract

The advent of modern technology has led to a surge of high-dimensional data in biology and health sciences such as genomics, epigenomics and medicine. The high-grade serous ovarian cancer (HGS-OvCa) data reported by The Cancer Genome Atlas (TCGA) Research Network is one example. The TCGA and other research groups have analyzed several aspects of these data. Here we study the relationship between Disease Free Time (DFT) after surgery among ovarian cancer patients and their DNA methylation profiles of genomic features. Such studies pose additional challenges beyond the typical big data problem due to population substructure and censoring. Despite the availability of several methods for analyzing time-to-event data with a large number of covariates but a small sample size, there is no method available to date that accommodates the additional feature of heterogeneity. To this end, we propose a regularized framework based on the finite mixture of accelerated failure time model to capture intangible heterogeneity due to population substructure and to account for censoring simultaneously. We study the properties of the proposed framework both theoretically and numerically. Our data analysis indicates the existence of heterogeneity in the HGS-OvCa data, with one component of the mixture capturing a more aggressive form of the disease, and the second component capturing a less aggressive form. In particular, the second component portrays a significant positive relationship between methylation and DFT for BRCA1. By further unearthing the negative relationship between expression and methylation for this gene, one may provide a biologically reasonable explanation that sheds light on the relationship between DNA methylation, gene expression and mutation.

#### Article information

Source
Ann. Appl. Stat., Volume 13, Number 1 (2019), 444-465.

Dates
Revised: March 2018
First available in Project Euclid: 10 April 2019

https://projecteuclid.org/euclid.aoas/1554861656

Digital Object Identifier
doi:10.1214/18-AOAS1198

Mathematical Reviews number (MathSciNet)
MR3937436

Zentralblatt MATH identifier
07057435

#### Citation

Shokoohi, Farhad; Khalili, Abbas; Asgharian, Masoud; Lin, Shili. Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates. Ann. Appl. Stat. 13 (2019), no. 1, 444--465. doi:10.1214/18-AOAS1198. https://projecteuclid.org/euclid.aoas/1554861656

#### References

• Bolton, K. L., Chenevix-Trench, G., Goh, C., Sadetzki, S., Ramus, S. J., Karlan, B. Y. et al. (2012). Association between BRCA1 and BRCA2 mutations and survival in women with invasive epithelial ovarian cancer. JAMA 307 382–389.
• Cai, J., Fan, J., Li, R. and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika 92 303–316.
• Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A. et al. (2012). The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2 401.
• Chen, J. and Tan, X. (2009). Inference for multivariate normal mixtures. J. Multivariate Anal. 100 1367–1383.
• Earp, M. A. and Cunningham, J. M. (2015). DNA methylation changes in epithelial ovarian cancer histotypes. Genomics 106 311–321.
• Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
• Fan, J. and Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty model. Ann. Statist. 30 74–99.
• Faraggi, D. and Simon, R. (1998). Bayesian variable selection method for censored survival data. Biometrics 54 1475–1485.
• Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O. et al. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6 pl1.
• Gilbride, T. J., Allenby, G. M. and Brazell, J. D. (2006). Models for heterogeneous variable selection. J. Mark. Res. 43 420–430.
• Han, S. W., Chen, G., Cheon, M.-S. and Zhong, H. (2016). Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J. Amer. Statist. Assoc. 111 1004–1019.
• Hunter, D. R., Wang, S. and Hettmansperger, T. P. (2007). Inference for mixtures of symmetric distributions. Ann. Statist. 35 224–251.
• Kasahara, H. and Shimotsu, K. (2015). Testing the number of components in normal mixture regression models. J. Amer. Statist. Assoc. 110 1632–1645.
• Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regression models. J. Amer. Statist. Assoc. 102 1025–1038.
• Kong, B., Lv, Z.-D., Wang, Y., Jin, L.-Y., Ding, L. and Yang, Z.-C. (2015). Down-regulation of BRMS1 by DNA hypermethylation and its association with metastatic progression in triple-negative breast cancer. Int. J. Clin. Exp. Pathol. 8 11076–11083.
• Koukoura, O., Spandidos, D. A., Daponte, A. and Sifakis, S. (2014). DNA methylation profiles in ovarian cancer: Implication in diagnosis and therapy (review). Mol. Med. Rep. 10 3–9.
• Kwon, M. J. and Shin, Y. K. (2011). Epigenetic regulation of cancer-associated genes in ovarian cancer. Int. J. Mol. Sci. 12 983–1008.
• Kwong, J., Lee, J.-Y., Wong, K.-K., Zhou, X., Wong, D. T. W., Lo, K.-W. et al. (2006). Candidate tumor-suppressor gene DLEC1 is frequently downregulated by promoter hypermethylation and histone hypoacetylation in human epithelial ovarian cancer. Neoplasia 8 268–278.
• Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data, 2nd ed. Wiley-Interscience, Hoboken, NJ.
• Lee, K. H., Chakraborty, S. and Sun, J. (2011). Bayesian variable selection in semiparametric proportional hazards model for high dimensional survival data. Int. J. Biostat. 7 Art. 21.
• Liu, J., Zhang, R., Zhao, W. and Lv, Y. (2015). Variable selection in semiparametric hazard regression for multivariate survival data. J. Multivariate Anal. 142 26–40.
• Lu, Z.-H. (2009). Covariate selection in mixture models with the censored response variable. Comput. Statist. Data Anal. 53 2710–2723.
• Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37 3498–3528.
• McLachlan, G. J. and McGiffin, D. C. (1994). On the role of finite mixture models in survival analysis. Stat. Methods Med. Res. 3 211–226.
• McLachlan, G. and Peel, D. (2004). Finite Mixture Models. Wiley, New York.
• Schöndorf, T., Ebert, M. P., Hoffmann, J., Becker, M., Moser, N., Pur, Ş. et al. (2016). Hypermethylation of the PTEN gene in ovarian cancer cell lines. Cancer Lett. 207 215–220.
• Sha, N., Tadesse, M. G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 22 2262–2268.
• Shokoohi, F., Khalili, A., Asgharian, M. and Lin, S. (2019). Supplement to “Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates.” DOI:10.1214/18-AOAS1198SUPP.
• Sohn, I., Kim, J., Jung, S.-H. and Park, C. (2009). Gradient Lasso for Cox proportional hazards model. Bioinformatics 25 1775–1781.
• Städler, N., Bühlmann, P. and van de Geer, S. (2010). $L_{1}$-penalization for mixture regression models. TEST 19 209–256.
• The Cancer Genome Atlas Research Network (2011). Integrated genomic analyses of ovarian carcinoma. Nature 474 609–615.
• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
• Tibshirani, R. (1997). The Lasso method for variable selection in the Cox model. Stat. Med. 16 385–395.
• Volinsky, C. T. and Raftery, A. E. (2000). Bayesian information criterion for censored survival models. Biometrics 56 256–262.
• Yang, D., Khan, S., Sun, Y., Hess, K., Shmulevich, I., Sood, A. K. et al. (2011). Association of BRCA1 and BRCA2 mutations with survival, chemotherapy sensitivity, and gene mutator phenotype in patients with ovarian cancer. JAMA 306 1557–1565.
• Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
• Zhang, F., Ding, J. and Lin, S. (2017). Testing for associations of opposite directionality in a heterogeneous population. Statist. Biosci. 9 137–159.
• Zhu, H.-T. and Zhang, H. (2004). Hypothesis testing in mixture regression models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 66 3–16.
• Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

#### Supplemental materials

• Supplement to “Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates”. Supplementary Materials referenced in Section 2–4, including regularity conditions, proofs, numerical approaches, supplementary tables and figures, and the fmrs output are available with this paper at the Annals of Applied Statistics website.