The Annals of Applied Statistics

Flexible risk prediction models for left or interval-censored data from electronic health records

Noorie Hyun, Li C. Cheung, Qing Pan, Mark Schiffman, and Hormuzd A. Katki

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Electronic health records are a large and cost-effective data source for developing risk-prediction models. However, for screen-detected diseases, standard risk models (such as Kaplan–Meier or Cox models) do not account for key issues encountered with electronic health record data: left-censoring of pre-existing (prevalent) disease, interval-censoring of incident disease, and ambiguity of whether disease is prevalent or incident when definitive disease ascertainment is not conducted at baseline. Furthermore, researchers might conduct novel screening tests only on a complex two-phase subsample. We propose a family of weighted mixture models that account for left/interval-censoring and complex sampling via inverse-probability weighting in order to estimate current and future absolute risk: we propose a weakly-parametric model for general use and a semiparametric model for checking goodness of fit of the weakly-parametric model. We demonstrate asymptotic properties analytically and by simulation. We used electronic health records to assemble a cohort of 33,295 human papillomavirus (HPV) positive women undergoing cervical cancer screening at Kaiser Permanente Northern California (KPNC) that underlie current screening guidelines. The next guidelines would focus on HPV typing tests, but reporting 14 HPV types is too complex for clinical use. National Cancer Institute along with KPNC conducted a HPV typing test on a complex subsample of 9258 women in the cohort. We used our model to estimate the risk due to each type and grouped the 14 types (the 3-year risk ranges 21.9–1.5) into 4 risk-bands to simplify reporting to clinicians and guidelines. These risk-bands could be adopted by future HPV typing tests and future screening guidelines.

Article information

Ann. Appl. Stat., Volume 11, Number 2 (2017), 1063-1084.

Received: July 2016
Revised: February 2017
First available in Project Euclid: 20 July 2017

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Mixture model interval censoring two-phase sampling B-splines weighted likelihood HIV


Hyun, Noorie; Cheung, Li C.; Pan, Qing; Schiffman, Mark; Katki, Hormuzd A. Flexible risk prediction models for left or interval-censored data from electronic health records. Ann. Appl. Stat. 11 (2017), no. 2, 1063--1084. doi:10.1214/17-AOAS1036.

Export citation


  • Breslow, N. E. and Wellner, J. A. (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand. J. Stat. 34 86–102.
  • Breslow, N. E., Lumley, T., Ballantyne, C. M., Chambless, L. E. and Kulich, M. (2009). Improved Horvitz-Thompson estimation of model parameter from two-phase stratified samples: Applications in epidemiology. Stat. Biosci. 1 32–49.
  • Cai, T. and Zheng, Y. (2013). Resampling procedures for making inference under nested case-control studies. J. Amer. Statist. Assoc. 108 1532–1544.
  • Castle, P. E., Fetterman, B., SCT (ASCP), Poitras, N., Lorey, T., Shaber, R. and Kinney, W. (2009). Five-year experience of human papillomavirus DNA and Papanicolaou test cotesting. Obstetrics & Gynecology 113 595–600.
  • Castle, P. E., Stoler, M. H., Wright, Jr., T. C., Sharma, A., Wright, T. L. and Behrens, C. M. (2011). Performance of carcinogenic human papillomavirus (HPV) testing and HPV16 or HPV18 genotyping for cervical cancer screening of women aged 25 years and older: A subanalysis of the ATHENA study. Lancet Oncol. 12 880–890.
  • Chaturvedi, A. K., Katki, H. A., Hildesheim, A., Rodríguez, A. C., Quint, W., Schiffman, M., Van Doorn, L. J., Porras, C., Wacholder, S., Gonzalez, P. and Sherman, M. E. (2011). Human papillomavirus infection with multiple types: Pattern of coinfection and risk of cervical disease. J. Infect. Dis. 203 910–920.
  • Cox, D. R. (1972). Regression models and life-tables. J. R. Stat. Soc. Ser. B. Stat. Methodol. 34 187–220.
  • Dorey, F. J., Little, R. J. A. and Schenker, N. (1993). Multiple imputation for threshold-crossing data with interval censoring. Stat. Med. 12 1589–1603.
  • Graubard, B. I. and Korn, E. L. (1996). Survey inference for subpopulations. Am. J. Epidemiol. 144 102–106.
  • Groeneboom, P. and Wellner, J. A. (1992). Information Bounds and Nonparametric Maximum Likelihood Estimation. DMV Seminar 19. Birkhäuser, Basel.
  • Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663–685.
  • Huang, J. and Rossini, A. J. (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J. Amer. Statist. Assoc. 92 960–967.
  • Huang, J. and Wellner, J. A. (1997). Interval censored survival data: A review of recent progress. In Proceedings of the First Seattle Symposium in Biostatistics (D. Y. Lin and T. R. Fleming, eds.) 123–169. Springer, New York.
  • Hyun, N., Cheung, L. C, Pan, Q., Schiffman, M. and Katki, H. A (2017). Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records.” DOI:10.1214/17-AOAS1036SUPP.
  • Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457–481.
  • Katki, H. A., Kinney, W. K., Fetterman, B., Lorey, T., Poitras, N. E., Cheung, L., Demuth, F., Schiffman, M., Wacholder, S. and Castle, P. E. (2011). Cervical cancer risk for 330,000 women undergoing concurrent HPV testing and cervical cytology in routine clinical practice at a large managed care organization. Lancet Oncol. 12 663–672.
  • Katki, H. A., Schiffman, M., Castle, P. E., Fetterman, B., Poitras, N. E., Lorey, T., Cheung, L. C., Raine-Bennett, T. R., Gage, J. C. and Kinney, W. K. (2013). Benchmarking CIN3+ risk as the basis for incorporating HPV and Pap cotesting into cervical screening and management guidelines. J. Low. Genit. Tract Dis. 17 S28–S35.
  • Kovalchik, S. A. and Pfeiffer, R. M. (2014). Population-based absolute risk estimation with survey data. Lifetime Data Anal. 20 252–275.
  • Li, C.-S., Taylor, J. M. G. and Sy, J. P. (2001). Identifiability of cure models. Statist. Probab. Lett. 54 389–395.
  • Lumley, T. (2016). Analyses of complex survey samples. Available at
  • Ma, S. (2010). Mixed case interval censored data with a cured subgroup. Statist. Sinica 20 1165–1181.
  • Massad, L. S., Einstein, M. H., Huh, W. K., Katki, H. A., Kinney, W. K., Schiffman, M., Solomon, D., Wentzensen, N. and Lawson, H. W. (2013). 2012 updated consensus guidelines for the management of abnormal cervical cancer screening tests and cancer precursors. J. Low. Genit. Tract Dis. 17 S1–S27.
  • Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384.
  • Odell, P. M., Anderson, K. M. and D’Agostino, R. B. (1992). Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 951–959.
  • Robertson, T., Wright, F. T. and Dykstra, R. L. (1988). Order Restricted Statistical Inference. Wiley, Chichester.
  • Rücker, R. and Messerer, D. (1988). Remission duration: An example of interval-censored observations. Stat. Med. 7 1139–1145.
  • Saegusa, T. (2015). Variance estimation under two-phase sampling. Scand. J. Stat. 42 1078–1091.
  • Schiffman, M., Wentzensen, N., Wacholder, S., Walter, K., Gage, J. C. and Castle, P. E. (2011). Human papillomavirus testing in the prevention of cervical cancer. J. Natl. Cancer Inst. 103 368–383.
  • Schiffman, M., Vaughan, L. M., Raine-Bennett, T. R., Castle, P. E., Katki, H. A., Gage, J. C., Fetterman, B., Befano, B. and Wentzensen, N. (2015). A study of HPV typing for the management of HPV-positive ASC-US cervical cytologic results. Gynecol. Oncol. 138 573–578.
  • Sen, B. and Banerjee, M. (2007). A pseudolikelihood method for analyzing interval censored data. Biometrika 94 71–86.
  • Shao, F., Li, J., Ma, S. and Lee, M.-L. T. (2014). Semiparametric varying-coefficient model for interval censored data with a cured proportion. Stat. Med. 33 1700–1712.
  • Tian, L. and Cai, T. (2006). On the accelerated failure time model for current status and interval censored data. Biometrika 93 329–342.
  • Wang, L., McMahan, C. S., Hudgens, M. G. and Qureshi, Z. P. (2016). A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72 222–231.
  • Woodward, M. (1999). Epidemiology: Study Design and Data Analysis. Chapman & Hall/CRC, Boca Raton, FL.
  • Zhang, Y., Hua, L. and Huang, J. (2010). A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand. J. Stat. 37 338–354.

Supplemental materials

  • Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records”. Supplementary materials available in the attached file includes the proofs for model identifiability and to establish useful asymptotic results of the estimates such as consistency and weak convergence to normal distribution under certain regularity conditions. The simulation studies and results are summarized in the supplementary materials.