The Annals of Applied Statistics
- Ann. Appl. Stat.
- Volume 11, Number 2 (2017), 1063-1084.
Flexible risk prediction models for left or interval-censored data from electronic health records
Electronic health records are a large and cost-effective data source for developing risk-prediction models. However, for screen-detected diseases, standard risk models (such as Kaplan–Meier or Cox models) do not account for key issues encountered with electronic health record data: left-censoring of pre-existing (prevalent) disease, interval-censoring of incident disease, and ambiguity of whether disease is prevalent or incident when definitive disease ascertainment is not conducted at baseline. Furthermore, researchers might conduct novel screening tests only on a complex two-phase subsample. We propose a family of weighted mixture models that account for left/interval-censoring and complex sampling via inverse-probability weighting in order to estimate current and future absolute risk: we propose a weakly-parametric model for general use and a semiparametric model for checking goodness of fit of the weakly-parametric model. We demonstrate asymptotic properties analytically and by simulation. We used electronic health records to assemble a cohort of 33,295 human papillomavirus (HPV) positive women undergoing cervical cancer screening at Kaiser Permanente Northern California (KPNC) that underlie current screening guidelines. The next guidelines would focus on HPV typing tests, but reporting 14 HPV types is too complex for clinical use. National Cancer Institute along with KPNC conducted a HPV typing test on a complex subsample of 9258 women in the cohort. We used our model to estimate the risk due to each type and grouped the 14 types (the 3-year risk ranges 21.9–1.5) into 4 risk-bands to simplify reporting to clinicians and guidelines. These risk-bands could be adopted by future HPV typing tests and future screening guidelines.
Ann. Appl. Stat., Volume 11, Number 2 (2017), 1063-1084.
Received: July 2016
Revised: February 2017
First available in Project Euclid: 20 July 2017
Permanent link to this document
Digital Object Identifier
Mathematical Reviews number (MathSciNet)
Zentralblatt MATH identifier
Hyun, Noorie; Cheung, Li C.; Pan, Qing; Schiffman, Mark; Katki, Hormuzd A. Flexible risk prediction models for left or interval-censored data from electronic health records. Ann. Appl. Stat. 11 (2017), no. 2, 1063--1084. doi:10.1214/17-AOAS1036. https://projecteuclid.org/euclid.aoas/1500537735
- Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records”. Supplementary materials available in the attached file includes the proofs for model identifiability and to establish useful asymptotic results of the estimates such as consistency and weak convergence to normal distribution under certain regularity conditions. The simulation studies and results are summarized in the supplementary materials.