The Annals of Applied Statistics

Risk prediction for prostate cancer recurrence through regularized estimation with simultaneous adjustment for nonlinear clinical effects

Qi Long, Matthias Chung, Carlos S. Moreno, and Brent A. Johnson

Full-text: Open access


In biomedical studies it is of substantial interest to develop risk prediction scores using high-dimensional data such as gene expression data for clinical endpoints that are subject to censoring. In the presence of well-established clinical risk factors, investigators often prefer a procedure that also adjusts for these clinical variables. While accelerated failure time (AFT) models are a useful tool for the analysis of censored outcome data, it assumes that covariate effects on the logarithm of time-to-event are linear, which is often unrealistic in practice. We propose to build risk prediction scores through regularized rank estimation in partly linear AFT models, where high-dimensional data such as gene expression data are modeled linearly and important clinical variables are modeled nonlinearly using penalized regression splines. We show through simulation studies that our model has better operating characteristics compared to several existing models. In particular, we show that there is a nonnegligible effect on prediction as well as feature selection when nonlinear clinical effects are misspecified as linear. This work is motivated by a recent prostate cancer study, where investigators collected gene expression data along with established prognostic clinical variables and the primary endpoint is time to prostate cancer recurrence. We analyzed the prostate cancer data and evaluated prediction performance of several models based on the extended c statistic for censored data, showing that (1) the relationship between the clinical variable, prostate specific antigen, and the prostate cancer recurrence is likely nonlinear, that is, the time to recurrence decreases as PSA increases and it starts to level off when PSA becomes greater than 11; (2) correct specification of this nonlinear effect improves performance in prediction and feature selection; and (3) addition of gene expression data does not seem to further improve the performance of the resultant risk prediction scores.

Article information

Ann. Appl. Stat., Volume 5, Number 3 (2011), 2003-2023.

First available in Project Euclid: 13 October 2011

Permanent link to this document

Digital Object Identifier

Zentralblatt MATH identifier

Accelerated failure time model feature selection Lasso partly linear model penalized splines rank estimation risk prediction


Long, Qi; Chung, Matthias; Moreno, Carlos S.; Johnson, Brent A. Risk prediction for prostate cancer recurrence through regularized estimation with simultaneous adjustment for nonlinear clinical effects. Ann. Appl. Stat. 5 (2011), no. 3, 2003--2023. doi:10.1214/11-AOAS458.

Export citation


  • Abramovitz, M., Ordanic-Kodani, M., Wang, Y., Li, Z., Catzavelos, C., Bouzyk, M., Sledge, G. W., Moreno, C. S. and Leyland-Jones, B. (2008). Optimization of RNA extraction from FFPE tissues for expression profiling in the DASL assay. Biotechniques 44 417–23.
  • Begg, C. B., Cramer, L. D., Venkatraman, E. S. and Rosai, J. (2000). Comparing tumour staging and grading systems: A case study and a review of the issues, using thymoma as a model. Stat. Med. 19 1997–2014.
  • Bibikova, M., Talantov, D., Chudin, E., Yeakley, J., Chen, J., Doucet, D., Wickham, E., Atkins, D., Barker, D., Chee, M., Wang, Y. and Fan, J. (2004). Quantitative gene expression profiling in formalin-fixed, paraffin-embedded tissues using universal bead arrays. Amer. J. Pathol. 165 1799–807.
  • Cai, T., Huang, J. and Tian, L. (2009). Regularized estimation for the accelerated failure time model. Biometrics 65 394–404.
  • Chen, K., Shen, J. and Ying, Z. (2005). Rank estimation in partial linear model with censored data. Statist. Sinica 15 767–779.
  • Claeskens, G., Krivobokova, T. and Opsomer, J. D. (2009). Asymptotic properties of penalized spline estimators. Biometrika 96 529–544.
  • Conrad, M. and Johnson, B. A. (2010). A quasi-Newton algorithm for efficient computation of Gehan estimates. Technical report, Dept. Biostatistics and Bioinformatics, Emory Univ.
  • Cox, D. R. (1972). Regression models and life-tables (with discussion). J. Roy. Statist. Soc. Ser. B 34 187–202.
  • Cox, D. R. and Oakes, D. (1984). Analysis of Survival Data. Chapman & Hall, London.
  • Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statist. Sci. 11 89–121.
  • Engle, R. F., Granger, C. W. J., Rice, J. and Weiss, A. (1986). Semiparametric estimates of the relation between weather and electricity sales. J. Amer. Statist. Assoc. 81 310–320.
  • Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman & Hall, London.
  • Gehan, E. A. (1965). A generalized Wilcoxon test for comparing arbitrarily single-censored samples. Biometrika 52 203–223.
  • Goeman, J. J. (2010). L1 penalized estimation in the Cox proportional hazards model. Biom. J. 52 70–84.
  • Gonen, M. and Heller, G. (2005). Concordance probability and discriminatory power in proportional hazards regression. Biometrika 92 965–970.
  • Härdle, W., Liang, H. and Gao, J. (2000). Partially Linear Models. Springer, New York.
  • Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall, New York.
  • Heckman, N. E. (1986). Spline smoothing in a partly linear model. J. Roy. Statist. Soc. Ser. B 48 244–248.
  • Jin, Z., Lin, D. Y., Wei, L. J. and Ying, Z. (2003). Rank-based inference for the accelerated failure time model. Biometrika 90 341–353.
  • Johnson, B. A. (2008). Variable selection in semiparametric linear regression with censored data. J. Roy. Statist. Soc. Ser. B 70 351–370.
  • Johnson, B. A. (2009). Rank-based estimation in the 1-regularized partly linear model for censored data with applications to integrated analyses of clinical predictors and gene expression data. Biostatistics 10 659–666.
  • Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data. Wiley, New York.
  • Kattan, M. W. (2003a). Comparison of Cox regression with other methods for determining predictin models and nomograms. J. Urology 170 S6–S10.
  • Kattan, M. W. (2003b). Judging new markers by their ability to improve predictive accuracy. J. Natl. Cancer Inst. 95 634–635.
  • Koenker, R., Ng, P. and Portnoy, S. (1994). Quantile smoothing splines. Biometrika 81 673–680.
  • Koul, H., Susarla, V. and van Ryzin, J. (1981). Regression analysis with randomly right-censored data. Ann. Statist. 9 1276–1288.
  • Li, Y. and Ruppert, D. (2008). On the asymptotics of penalized splines. Biometrika 95 415–436.
  • Li, Y., Liu, Y. and Zhu, J. (2007). Quantile regression in reproducing kernel Hilbert spaces. J. Amer. Statist. Assoc. 102 255–268.
  • Liang, H. and Zhou, Y. (1998). Asymptotic normality in a semiparametric partial linear model with right-censored data. Comm. Statist. Theory Methods 27 2895–2907.
  • Nocedal, J. and Wright, S. J. (2006). Numerical Optimization. Springer, New York.
  • Reid, N. (1994). A conversation with Sir David Cox. Statist. Sci. 9 439–455.
  • Ruppert, D. and Carroll, R. J. (1997). Penalized regression splines. Unpublished technical report.
  • Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). Semiparametric Regression. Cambridge Univ. Press, New York.
  • Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J. and Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for traditional and novel measures. Epidemiology 21 128–138.
  • Stone, C. (1980). Optimal rates of convergence for nonparametric estimators. Ann. Statist. 8 1348–1360.
  • Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288.
  • Tibshirani, R. J. (1997). The lasso method for variable selection in the Cox model. Statist. Med. 16 385–395.
  • Tsiatis, A. A. (1990). Estimating regression parameters using linear rank tests for censored data. Ann. Statist. 18 354–372.
  • Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer, New York.
  • Wang, Q. and Li, G. (2002). Empirical likelihood semiparametric regression analysis under random censorship. J. Multivariate Anal. 83 469–486.
  • Ying, Z. (1993). A large sample study of rank estimation for censored regression data. Ann. Statist. 21 76–99.