The Annals of Applied Statistics

Oblique random survival forests

Byron C. Jaeger, D. Leann Long, Dustin M. Long, Mario Sims, Jeff M. Szychowski, Yuan-I Min, Leslie A. Mcclure, George Howard, and Noah Simon

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We introduce and evaluate the oblique random survival forest (ORSF). The ORSF is an ensemble method for right-censored survival data that uses linear combinations of input variables to recursively partition a set of training data. Regularized Cox proportional hazard models are used to identify linear combinations of input variables in each recursive partitioning step. Benchmark results using simulated and real data indicate that the ORSF’s predicted risk function has high prognostic value in comparison to random survival forests, conditional inference forests, regression and boosting. In an application to data from the Jackson Heart Study, we demonstrate variable and partial dependence using the ORSF and highlight characteristics of its ten-year predicted risk function for atherosclerotic cardiovascular disease events (ASCVD; stroke, coronary heart disease). We present visualizations comparing variable and partial effect estimation according to the ORSF, the conditional inference forest, and the Pooled Cohort Risk equations. The obliqueRSF R package, which provides functions to fit the ORSF and create variable and partial dependence plots, is available on the comprehensive R archive network (CRAN).

Article information

Source
Ann. Appl. Stat., Volume 13, Number 3 (2019), 1847-1883.

Dates
Received: November 2018
Revised: April 2019
First available in Project Euclid: 17 October 2019

Permanent link to this document
https://projecteuclid.org/euclid.aoas/1571277776

Digital Object Identifier
doi:10.1214/19-AOAS1261

Mathematical Reviews number (MathSciNet)
MR4019160

Keywords
Random forest survival machine learning penalized regression cardiovascular disease

Citation

Jaeger, Byron C.; Long, D. Leann; Long, Dustin M.; Sims, Mario; Szychowski, Jeff M.; Min, Yuan-I; Mcclure, Leslie A.; Howard, George; Simon, Noah. Oblique random survival forests. Ann. Appl. Stat. 13 (2019), no. 3, 1847--1883. doi:10.1214/19-AOAS1261. https://projecteuclid.org/euclid.aoas/1571277776


Export citation

References

  • Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (2012). Statistical Models Based on Counting Processes. Springer, Berlin
  • Bien, J. and Tibshirani, R. (2019). protoclust: Hierarchical clustering with prototypes. R package version 1.6.3.
  • Binder, H. (2013). CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks. R package version 1.4, available at https://CRAN.R-project.org/package=CoxBoost.
  • Blanche, P., Kattan, M. W. and Gerds, T. A. (2019). The c-index is not proper for the evaluation of $t$-year predicted risks. Biostatistics 20 347–357.
  • Bou-Hamad, I., Larocque, D. and Ben-Ameur, H. (2011). A review of survival trees. Stat. Surv. 5 44–71.
  • Breiman, L. (1984). Classification and Regression Trees. Routledge, Abingdon.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Breiman, L. and Cutler, A. (2003). Setting up, using, and understanding random forests V4.0. Dept. Statistics, Univ. California, Berkeley.
  • Brilleman, S. (2018). simsurv: Simulate survival data. R package version 0.2.2, available at https://CRAN.R-project.org/package=simsurv.
  • Burnham, K. P. and Anderson, D. R. (2004). Multimodel inference: Understanding AIC and BIC in model selection. Sociol. Methods Res. 33 261–304.
  • Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM International Conference on Knowledge Discovery and Data Mining 785–794. ACM.
  • Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I. et al. (2019). xgboost: Extreme gradient boosting. R package version 0.81.0.1, available at https://CRAN.R-project.org/package=xgboost.
  • Cox, D. R. (1992). Regression models and life-tables. In Breakthroughs in Statistics 527–541. Springer, Berlin.
  • Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 1–30.
  • Desmedt, C., Di Leo, A., de Azambuja, E., Larsimont, D., Haibe-Kains, B., Selleslags, J., Delaloge, S., Duhem, C., Kains, J.-P. et al. (2011). Multifactorial approach to predicting resistance to anthracyclines. J. Clin. Oncol. 29 1578–1586.
  • Dheeru, D. and Karra Taniskidou, E. (2017). UCI Machine learning repository. Univ. California, Irvine.
  • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499.
  • Fleming, T. R. and Harrington, D. P. (2011). Counting Processes and Survival Analysis 169. Wiley, New York.
  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189–1232.
  • Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1–22. Available at http://www.jstatsoft.org/v33/i01/.
  • Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32 675–701.
  • Gerds, T. A., Kattan, M. W., Schumacher, M. and Yu, C. (2013). Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Stat. Med. 32 2173–2184.
  • Geurts, P., Ernst, D. and Wehenkel, L. (2006). Extremely randomized trees. Mach. Learn. 63 3–42.
  • Graf, E., Schmoor, C., Sauerbrei, W. and Schumacher, M. (1999). Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 18 2529–2545.
  • Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. and Rosati, R. A. (1982). Evaluating the yield of medical tests. JAMA 247 2543–2546.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York.
  • Hatzis, C., Pusztai, L., Valero, V., Booser, D. J., Esserman, L., Lluch, A., Vidaurre, T., Holmes, F., Souchon, E. et al. (2011). A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 305 1873–1881.
  • Heagerty, P. J., Lumley, T. and Pepe, M. S. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56 337–344.
  • Heagerty, P. J. and Zheng, Y. (2005). Survival model predictive accuracy and ROC curves. Biometrics 61 92–105.
  • Hothorn, T., Hornik, K., Strobl, C. and Zeileis, A. (2019). party: A laboratory for recursive partytioning. R package version 1.3.3, available at https://CRAN.R-project.org/package=party.
  • Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15 651–674.
  • Hothorn, T. and Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognit. 36 1303–1309.
  • Hothorn, T., Lausen, B., Benner, A. and Radespiel-Tröger, M. (2004). Bagging survival trees. Stat. Med. 23 77–91.
  • Howard, V. J., Cushman, M., Pulley, L., Gomez, C. R., Go, R. C., Prineas, R. J., Graham, A., Moy, C. S. and Howard, G. (2005). The reasons for geographic and racial differences in stroke study: Objectives and design. Neuroepidemiology 25 135–143.
  • Iman, R. L. and Davenport, J. M. (1980). Approximations of the critical region of the Fbietkan statistic. Comm. Statist. Theory Methods 9 571–595.
  • Ishwaran, H. and Kogalur, U. B. (2019). Random forests for survival, regression, and classification (RF-SRC). R package version 2.8.0, available at https://cran.r-project.org/package=randomForestSRC.
  • Ishwaran, H., Kogalur, U. B., Blackstone, E. H. and Lauer, M. S. (2008). Random survival forests. Ann. Appl. Stat. 2 841–860.
  • Jaeger, B. (2018). obliqueRSF: Oblique random forests for right-censored time-to-event data. R package version 0.1.0, available at https://CRAN.R-project.org/package=obliqueRSF.
  • Jaeger, B. C., Long, L. D., Long, D. M., Sims, M., Szychowski, J. M., Min, Y.-I., Mcclure, L. A., Howard, G. and Simon, N. (2019). Supplement to “Oblique random survival forests.” DOI:10.1214/19-AOAS1261SUPP.
  • Kowarik, A. and Templ, M. (2016). Imputation with the R package VIM. J. Stat. Softw. 74 1–16.
  • Levey, A. S., Stevens, L. A., Schmid, C. H., Zhang, Y. L., Castro, A. F., Feldman, H. I., Kusek, J. W., Eggers, P., Van Lente, F. et al. (2009). A new equation to estimate glomerular filtration rate. Ann. Intern. Med. 150 604–612.
  • Lundberg, S. M., Erion, G. G. and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888.
  • McCall, M. N., Bolstad, B. M. and Irizarry, R. A. (2010). Frozen robust multiarray analysis (fRMA). Biostatistics 11 242–253.
  • Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 Paper No. 26, 41.
  • Menze, B. H., Kelm, B. M., Splitthoff, D. N., Koethe, U. and Hamprecht, F. A. (2011). On oblique random forests. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases 453–469. Springer, Berlin.
  • Mogensen, U. B., Ishwaran, H. and Gerds, T. A. (2012). Evaluating random forests for survival analysis using prediction error curves. J. Stat. Softw. 50 1.
  • Morris, T. P., White, I. R. and Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Stat. Med. 38 2074–2102.
  • Nasejje, J. B., Mwambi, H., Dheda, K. and Lesosky, M. (2017). A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data. BMC Med. Res. Methodol. 17 115.
  • Rainforth, T. and Wood, F. (2015). Canonical correlation forests. arXiv preprint arXiv:1507.05444.
  • Safford, M. M., Brown, T. M., Muntner, P. M., Durant, R. W., Glasser, S., Halanych, J. H., Shikany, J. M., Prineas, R. J., Samdarshi, T. et al. (2012). Association of race and sex with risk of incident acute coronary heart disease events. JAMA 308 1768–1774.
  • Schumacher, M., Bastert, G., Bojar, H., Huebner, K., Olschewski, M., Sauerbrei, W., Schmoor, C., Beyerle, C., Neumann, R. et al. (1994). Randomized 2 $\times$ 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. J. Clin. Oncol. 12 2086–2093.
  • Segal, M. R. (1988). Regression trees for censored data. Biometrics 44 35–47.
  • Shabalin, A. A., Tjelmeland, H., Fan, C., Perou, C. M. and Nobel, A. B. (2008). Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24 1154–1160.
  • Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39 1–13.
  • Strasser, H. and Weber, C. (1999). The asymptotic theory of permutation statistics. Math. Methods Statist. 8 220–250. Johann Pfanzagl—on the occasion of his 70th birthday.
  • Strobl, C., Malley, J. and Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 14 323–348.
  • Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 8 25.
  • Taylor Jr., H. A., Wilson, J. G., Jones, D. W., Sarpong, D. F., Srinivasan, A., Garrison, R. J., Nelson, C. and Wyatt, S. B. (2005). Toward resolution of cardiovascular health disparities in African americans: Design and methods of the Jackson heart study. Ethn. Dis. 15 S6–4.
  • Ternès, N., Rotolo, F., Heinze, G. and Michiels, S. (2017). Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biom. J. 59 685–701.
  • Therneau, T. M. (2015). A package for survival analysis in S. R package version 2.38, available at https://CRAN.R-project.org/package=survival.
  • Tutz, G. and Binder, H. (2007). Boosting ridge regression. Comput. Statist. Data Anal. 51 6044–6059.
  • van Houwelingen, H. C., Bruinsma, T., Hart, A. A. M., van’t Veer, L. J. and Wessels, L. F. A. (2006). Cross-validated Cox regression on microarray gene expression data. Stat. Med. 25 3201–3216.
  • Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530.
  • Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S, 4th ed. Springer, New York.
  • Whelton, P. K., Carey, R. M., Aronow, W. S., Casey, D. E., Collins, K. J., Himmelfarb, C. D., DePalma, S. M., Gidding, S., Jamerson, K. A. et al. (2018). 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: A report of the American college of cardiology/American heart association task force on clinical practice guidelines. J. Am. Coll. Cardiol. 71 e127–e248.
  • Zhu, R. (2013). Tree-Based Methods for Survival Analysis and High-Dimensional Data. Thesis (Ph.D.)–Univ. North Carolina at Chapel Hill. ProQuest LLC, Ann Arbor, MI.
  • Zhu, R., Zeng, D. and Kosorok, M. R. (2015). Reinforcement learning trees. J. Amer. Statist. Assoc. 110 1770–1784.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–320.

Supplemental materials

  • Source code for analyses presented in the oblique random survival forest manuscript. Provides scripts written in R that were applied to generate the results presented in the manuscript. In particular, the scripts were applied to conduct the simulation/resampling study and the application of oblique random survival forests to the Jackon Heart Study.