The Annals of Applied Statistics

Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates

Farhad Shokoohi, Abbas Khalili, Masoud Asgharian, and Shili Lin

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


The advent of modern technology has led to a surge of high-dimensional data in biology and health sciences such as genomics, epigenomics and medicine. The high-grade serous ovarian cancer (HGS-OvCa) data reported by The Cancer Genome Atlas (TCGA) Research Network is one example. The TCGA and other research groups have analyzed several aspects of these data. Here we study the relationship between Disease Free Time (DFT) after surgery among ovarian cancer patients and their DNA methylation profiles of genomic features. Such studies pose additional challenges beyond the typical big data problem due to population substructure and censoring. Despite the availability of several methods for analyzing time-to-event data with a large number of covariates but a small sample size, there is no method available to date that accommodates the additional feature of heterogeneity. To this end, we propose a regularized framework based on the finite mixture of accelerated failure time model to capture intangible heterogeneity due to population substructure and to account for censoring simultaneously. We study the properties of the proposed framework both theoretically and numerically. Our data analysis indicates the existence of heterogeneity in the HGS-OvCa data, with one component of the mixture capturing a more aggressive form of the disease, and the second component capturing a less aggressive form. In particular, the second component portrays a significant positive relationship between methylation and DFT for BRCA1. By further unearthing the negative relationship between expression and methylation for this gene, one may provide a biologically reasonable explanation that sheds light on the relationship between DNA methylation, gene expression and mutation.

Article information

Ann. Appl. Stat., Volume 13, Number 1 (2019), 444-465.

Received: May 2017
Revised: March 2018
First available in Project Euclid: 10 April 2019

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

DNA methylation ovarian cancer finite mixture of AFT model penalized regression right censoring


Shokoohi, Farhad; Khalili, Abbas; Asgharian, Masoud; Lin, Shili. Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates. Ann. Appl. Stat. 13 (2019), no. 1, 444--465. doi:10.1214/18-AOAS1198.

Export citation


  • Bolton, K. L., Chenevix-Trench, G., Goh, C., Sadetzki, S., Ramus, S. J., Karlan, B. Y. et al. (2012). Association between BRCA1 and BRCA2 mutations and survival in women with invasive epithelial ovarian cancer. JAMA 307 382–389.
  • Cai, J., Fan, J., Li, R. and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika 92 303–316.
  • Cerami, E., Gao, J., Dogrusoz, U., Gross, B. E., Sumer, S. O., Aksoy, B. A. et al. (2012). The cBio cancer genomics portal: An open platform for exploring multidimensional cancer genomics data. Cancer Discov. 2 401.
  • Chen, J. and Tan, X. (2009). Inference for multivariate normal mixtures. J. Multivariate Anal. 100 1367–1383.
  • Earp, M. A. and Cunningham, J. M. (2015). DNA methylation changes in epithelial ovarian cancer histotypes. Genomics 106 311–321.
  • Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360.
  • Fan, J. and Li, R. (2002). Variable selection for Cox’s proportional hazards model and frailty model. Ann. Statist. 30 74–99.
  • Faraggi, D. and Simon, R. (1998). Bayesian variable selection method for censored survival data. Biometrics 54 1475–1485.
  • Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O. et al. (2013). Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal. Sci. Signal. 6 pl1.
  • Gilbride, T. J., Allenby, G. M. and Brazell, J. D. (2006). Models for heterogeneous variable selection. J. Mark. Res. 43 420–430.
  • Han, S. W., Chen, G., Cheon, M.-S. and Zhong, H. (2016). Estimation of directed acyclic graphs through two-stage adaptive lasso for gene network inference. J. Amer. Statist. Assoc. 111 1004–1019.
  • Hunter, D. R., Wang, S. and Hettmansperger, T. P. (2007). Inference for mixtures of symmetric distributions. Ann. Statist. 35 224–251.
  • Kasahara, H. and Shimotsu, K. (2015). Testing the number of components in normal mixture regression models. J. Amer. Statist. Assoc. 110 1632–1645.
  • Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regression models. J. Amer. Statist. Assoc. 102 1025–1038.
  • Kong, B., Lv, Z.-D., Wang, Y., Jin, L.-Y., Ding, L. and Yang, Z.-C. (2015). Down-regulation of BRMS1 by DNA hypermethylation and its association with metastatic progression in triple-negative breast cancer. Int. J. Clin. Exp. Pathol. 8 11076–11083.
  • Koukoura, O., Spandidos, D. A., Daponte, A. and Sifakis, S. (2014). DNA methylation profiles in ovarian cancer: Implication in diagnosis and therapy (review). Mol. Med. Rep. 10 3–9.
  • Kwon, M. J. and Shin, Y. K. (2011). Epigenetic regulation of cancer-associated genes in ovarian cancer. Int. J. Mol. Sci. 12 983–1008.
  • Kwong, J., Lee, J.-Y., Wong, K.-K., Zhou, X., Wong, D. T. W., Lo, K.-W. et al. (2006). Candidate tumor-suppressor gene DLEC1 is frequently downregulated by promoter hypermethylation and histone hypoacetylation in human epithelial ovarian cancer. Neoplasia 8 268–278.
  • Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data, 2nd ed. Wiley-Interscience, Hoboken, NJ.
  • Lee, K. H., Chakraborty, S. and Sun, J. (2011). Bayesian variable selection in semiparametric proportional hazards model for high dimensional survival data. Int. J. Biostat. 7 Art. 21.
  • Liu, J., Zhang, R., Zhao, W. and Lv, Y. (2015). Variable selection in semiparametric hazard regression for multivariate survival data. J. Multivariate Anal. 142 26–40.
  • Lu, Z.-H. (2009). Covariate selection in mixture models with the censored response variable. Comput. Statist. Data Anal. 53 2710–2723.
  • Lv, J. and Fan, Y. (2009). A unified approach to model selection and sparse recovery using regularized least squares. Ann. Statist. 37 3498–3528.
  • McLachlan, G. J. and McGiffin, D. C. (1994). On the role of finite mixture models in survival analysis. Stat. Methods Med. Res. 3 211–226.
  • McLachlan, G. and Peel, D. (2004). Finite Mixture Models. Wiley, New York.
  • Schöndorf, T., Ebert, M. P., Hoffmann, J., Becker, M., Moser, N., Pur, Ş. et al. (2016). Hypermethylation of the PTEN gene in ovarian cancer cell lines. Cancer Lett. 207 215–220.
  • Sha, N., Tadesse, M. G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 22 2262–2268.
  • Shokoohi, F., Khalili, A., Asgharian, M. and Lin, S. (2019). Supplement to “Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates.” DOI:10.1214/18-AOAS1198SUPP.
  • Sohn, I., Kim, J., Jung, S.-H. and Park, C. (2009). Gradient Lasso for Cox proportional hazards model. Bioinformatics 25 1775–1781.
  • Städler, N., Bühlmann, P. and van de Geer, S. (2010). $L_{1}$-penalization for mixture regression models. TEST 19 209–256.
  • The Cancer Genome Atlas Research Network (2011). Integrated genomic analyses of ovarian carcinoma. Nature 474 609–615.
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. (1997). The Lasso method for variable selection in the Cox model. Stat. Med. 16 385–395.
  • Volinsky, C. T. and Raftery, A. E. (2000). Bayesian information criterion for censored survival models. Biometrics 56 256–262.
  • Yang, D., Khan, S., Sun, Y., Hess, K., Shmulevich, I., Sood, A. K. et al. (2011). Association of BRCA1 and BRCA2 mutations with survival, chemotherapy sensitivity, and gene mutator phenotype in patients with ovarian cancer. JAMA 306 1557–1565.
  • Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38 894–942.
  • Zhang, F., Ding, J. and Lin, S. (2017). Testing for associations of opposite directionality in a heterogeneous population. Statist. Biosci. 9 137–159.
  • Zhu, H.-T. and Zhang, H. (2004). Hypothesis testing in mixture regression models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 66 3–16.
  • Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429.

Supplemental materials

  • Supplement to “Capturing heterogeneity of covariate effects in hidden subpopulations in the presence of censoring and large number of covariates”. Supplementary Materials referenced in Section 2–4, including regularity conditions, proofs, numerical approaches, supplementary tables and figures, and the fmrs output are available with this paper at the Annals of Applied Statistics website.