The Annals of Statistics

Q-learning with censored data

Yair Goldberg and Michael R. Kosorok

Full-text: Open access


We develop methodology for a multistage decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the optimal Q-function belongs to the approximation space, the expected survival time for policies obtained by the algorithm converges to that of the optimal policy. We simulate a multistage clinical trial with flexible number of stages and apply the proposed censored-Q-learning algorithm to find individualized treatment regimens. The methodology presented in this paper has implications in the design of personalized medicine trials in cancer and in other life-threatening diseases.

Article information

Ann. Statist., Volume 40, Number 1 (2012), 529-560.

First available in Project Euclid: 7 May 2012

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G05: Estimation 62G20: Asymptotic properties 62N02: Estimation

Q-learning reinforcement learning survival analysis generalization error


Goldberg, Yair; Kosorok, Michael R. Q-learning with censored data. Ann. Statist. 40 (2012), no. 1, 529--560. doi:10.1214/12-AOS968.

Export citation


  • Anthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge Univ. Press, Cambridge.
  • Bellman, R. (1957). Dynamic Programming. Princeton Univ. Press, Princeton, NJ.
  • Biganzoli, E., Boracchi, P., Mariani, L. and Marubini, E. (1998). Feed forward neural networks for the analysis of censored survival data: A partial logistic regression approach. Stat. Med. 17 1169–1186.
  • Bitouzé, D., Laurent, B. and Massart, P. (1999). A Dvoretzky–Kiefer–Wolfowitz type inequality for the Kaplan–Meier estimator. Ann. Inst. Henri Poincaré Probab. Stat. 35 735–763.
  • Chen, P.-Y. and Tsiatis, A. A. (2001). Causal inference on the difference of the restricted mean lifetime between two groups. Biometrics 57 1030–1038.
  • Goldberg, Y. and Kosorok, M. R. (2012). Supplement to “Q-learning with censored data.” DOI:10.1214/12-AOS968SUPP.
  • Goldberg, Y. and Kosorok, M. R. (2012). Support vector regression for right censored data. Unpublished manuscript. Available at
  • Karrison, T. G. (1997). Use of Irwin’s restricted mean as an index for comparing survival in different treatment groups—interpretation and power considerations. Control Clin. Trials 18 151–167.
  • Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer, New York.
  • Krzakowski, M., Ramlau, R., Jassem, J., Szczesna, A., Zatloukal, P., Pawel, J. V., Sun, X., Bennouna, J., Santoro, A., Biesma, B., Delgado, F. M., Salhi, Y., Vaissiere, N., Hansen, O., Tan, E.-H., Quoix, E., Garrido, P. and Douillard, J.-Y. (2010). Phase III trial comparing vinflunine with docetaxel in second-line advanced non-small-cell lung cancer previously treated with platinum-containing chemotherapy. J. Clin. Oncol. 28 2167–2173.
  • Laber, E., Qian, M., Lizotte, D. J. and Murphy, S. A. (2010). Statistical inference in dynamic treatment regimes. Available at
  • Lavori, P. W. and Dawson, R. (2004). Dynamic treatment regimes: Practical design considerations. Clin. Trials 1 9–20.
  • Lunceford, J. K., Davidian, M. and Tsiatis, A. A. (2002). Estimation of survival distributions of treatment policies in two-stage randomization designs in clinical trials. Biometrics 58 48–57.
  • Miyahara, S. and Wahed, A. S. (2010). Weighted Kaplan–Meier estimators for two-stage treatment regimes. Stat. Med. 29 2581–2591.
  • Moodie, E. E. M., Richardson, T. S. and Stephens, D. A. (2007). Demystifying optimal dynamic treatment regimes. Biometrics 63 447–455.
  • Murphy, S. A. (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 331–366.
  • Murphy, S. A. (2005a). An experimental design for the development of adaptive treatment strategies. Stat. Med. 24 1455–1481.
  • Murphy, S. A. (2005b). A generalization error for Q-learning. J. Mach. Learn. Res. 6 1073–1097 (electronic).
  • Murphy, S. A., Oslin, D. W., Rush, A. J., Zhu, J. and MCATS (2007). Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychopharmacology 32 257–262.
  • Orellana, L., Rotnitzky, A. and Robins, J. M. (2010). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, Part I: Main content. Int. J. Biostat. 6 Art. 8, 49.
  • Robins, J. M. (1999). Association, causation, and marginal structural models. Synthese 121 151–179.
  • Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics (D. Lin and P. J. Heagerty, eds.) 189–326. Springer, New York.
  • Robins, J., Orellana, L. and Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Stat. Med. 27 4678–4721.
  • Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846–866.
  • Satten, G. A. and Datta, S. (2001). The Kaplan–Meier estimator as an inverse-probability-of-censoring weighted average. Amer. Statist. 55 207–210.
  • Shim, J. and Hwang, C. (2009). Support vector censored quantile regression under random censoring. Comput. Statist. Data Anal. 53 912–919.
  • Shivaswamy, P. K., Chu, W. and Jansche, M. (2007). A support vector approach to censored targets. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007), Omaha, Nebraska, USA 655–660. IEEE Computer Society.
  • Steinwart, I. and Christmann, A. (2008). Support Vector Machines. Springer, New York.
  • Stinchcombe, T. E. and Socinski, M. A. (2008). Considerations for second-line therapy of non-small cell lung cancer. Oncologist 13 28–36.
  • Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA.
  • Thall, P. F., Wooten, L. H., Logothetis, C. J., Millikan, R. E. and Tannir, N. M. (2007). Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Stat. Med. 26 4687–4702.
  • Tsitsiklis, J. N. and van Roy, B. (1996). Feature-based methods for large scale dynamic programming. Machine Learning 22 59–94.
  • van der Laan, M. J. and Petersen, M. L. (2007). Causal effect models for realistic individualized treatment and intention to treat rules. Int. J. Biostat. 3 Art. 3, 54.
  • van der Vaart, A. W. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, New York.
  • Vapnik, V. N. (1999). The Nature of Statistical Learning Theory, 2nd ed. Springer, New York.
  • Wahed, A. S. (2009). Estimation of survival quantiles in two-stage randomization designs. J. Statist. Plann. Inference 139 2064–2075.
  • Wahed, A. S. and Tsiatis, A. A. (2006). Semiparametric efficient estimation of survival distributions in two-stage randomisation designs in clinical trials with censored data. Biometrika 93 163–177.
  • Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis, Cambridge Univ.
  • Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning 8 279–292.
  • Wellner, J. A. (2007). On an exponential bound for the Kaplan–Meier estimator. Lifetime Data Anal. 13 481–496.
  • Zhao, Y., Kosorok, M. R. and Zeng, D. (2009). Reinforcement learning design for cancer clinical trials. Stat. Med. 28 3294–3315.
  • Zhao, Y., Zeng, D., Socinski, M. A. and Kosorok, M. R. (2011). Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics 67 1422–1433.
  • Zucker, D. M. (1998). Restricted mean life with covariates: Modification and extension of a useful survival analysis method. J. Amer. Statist. Assoc. 93 702–709.

Supplemental materials