The Annals of Applied Statistics

Tree-based reinforcement learning for estimating optimal dynamic treatment regimes

Yebin Tao, Lu Wang, and Daniel Almirall

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


Dynamic treatment regimes (DTRs) are sequences of treatment decision rules, in which treatment may be adapted over time in response to the changing course of an individual. Motivated by the substance use disorder (SUD) study, we propose a tree-based reinforcement learning (T-RL) method to directly estimate optimal DTRs in a multi-stage multi-treatment setting. At each stage, T-RL builds an unsupervised decision tree that directly handles the problem of optimization with multiple treatment comparisons, through a purity measure constructed with augmented inverse probability weighted estimators. For the multiple stages, the algorithm is implemented recursively using backward induction. By combining semiparametric regression with flexible tree-based learning, T-RL is robust, efficient and easy to interpret for the identification of optimal DTRs, as shown in the simulation studies. With the proposed method, we identify dynamic SUD treatment regimes for adolescents.

Article information

Ann. Appl. Stat., Volume 12, Number 3 (2018), 1914-1938.

Received: October 2016
Revised: August 2017
First available in Project Euclid: 11 September 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Multi-stage decision-making personalized medicine classification backward induction decision tree


Tao, Yebin; Wang, Lu; Almirall, Daniel. Tree-based reinforcement learning for estimating optimal dynamic treatment regimes. Ann. Appl. Stat. 12 (2018), no. 3, 1914--1938. doi:10.1214/18-AOAS1137.

Export citation


  • Almirall, D., McCaffrey, D. F., Griffin, B. A., Ramchand, R., Yuen, R. A. and Murphy, S. A. (2012). Examining moderated effects of additional adolescent substance use treatment: Structural nested mean model estimation using inverse-weighted regression-with-residuals. Technical Report No. 12-121, Penn State Univ., Univiversity Park, PA.
  • Bather, J. (2000). Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. Wiley, Chichester.
  • Breiman, L. (2001). Random forests. Mach. Learn. 45 5–32.
  • Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth Advanced Books and Software, Belmont, CA.
  • Chakraborty, B. and Moodie, E. E. M. (2013). Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference, and Personalized Medicine. Springer, New York.
  • Chakraborty, B. and Murphy, S. (2014). Dynamic treatment regimes. Annual Review of Statistics and Its Application 1 447–464.
  • Cortes, C. and Vapnik, V. (1995). Support-vector networks. Mach. Learn. 20 273–297.
  • Elomaa, T. and Malinen, T. (2003). On lookahead heuristics in decision tree learning. In International Symposium on Methodologies for Intelligent Systems. Lecture Notes in Artificial Intelligence 2871 445–453. Springer, Heidelberg.
  • Esmeir, S. and Markovitch, S. (2004). Lookahead-based algorithms for anytime induction of decision trees. In Proceedings of the Twenty-First International Conference on Machine Learning 257–264. ACM, New York.
  • Gifford, S. (2015). Difference between outpatient and inpatient treatment programs. Psych Central. Retrieved on July 6, 2016, from
  • Hernán, M. A., Brumback, B. and Robins, J. M. (2001). Marginal structural models to estimate the joint causal effect of nonrandomized treatments. J. Amer. Statist. Assoc. 96 440–448.
  • Hser, Y.-I., Anglin, M. D., Grella, C., Longshore, D. and Prendergast, M. L. (1997). Drug treatment careers A conceptual framework and existing research findings. J. Subst. Abuse Treat. 14 543–558.
  • Huang, X., Choi, S., Wang, L. and Thall, P. F. (2015). Optimization of multi-stage dynamic treatment regimes utilizing accumulated data. Stat. Med. 34 3423–3443.
  • Laber, E. B. and Zhao, Y. Q. (2015). Tree-based methods for individualized treatment regimes. Biometrika 102 501–514.
  • Lakkaraju, H. and Rudin, C. (2017). Learning cost-effective and interpretable treatment regimes. Proceedings of Machine Learning Research 54 166–175.
  • Marlatt, G. A. and Donovan, D. M. (2005). Relapse Prevention: Maintenance Strategies in the Treatment of Addictive Behaviors. Guilford Press, New York, NY.
  • McLellan, A. T., Lewis, D. C., O’brien, C. P. and Kleber, H. D. (2000). Drug dependence, a chronic medical illness: Implications for treatment, insurance, and outcomes evaluation. J. Am. Med. Dir. Assoc. 284 1689–1695.
  • Menard, S. (2002). Applied Logistic Regression Analysis, 2nd ed. Sage, Thousand Oaks, CA.
  • Moodie, E. E. M., Chakraborty, B. and Kramer, M. S. (2012). Q-learning for estimating optimal dynamic treatment rules from observational data. Canad. J. Statist. 40 629–645.
  • Murphy, S. A. (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 331–366.
  • Murphy, S. A. (2005). An experimental design for the development of adaptive treatment strategies. Stat. Med. 24 1455–1481.
  • Murphy, S. A., van der Laan, M. J. and Robins, J. M. (2001). Marginal mean models for dynamic regimes. J. Amer. Statist. Assoc. 96 1410–1423.
  • Murphy, S. A., Lynch, K. G., Oslin, D., McKay, J. R. and TenHave, T. (2007). Developing adaptive treatment strategies in substance abuse research. Drug Alcohol Depend. 88 S24–S30.
  • Murthy, S. and Salzberg, S. (1995). Lookahead and pathology in decision tree induction. In Proceedings of Fourteenth International Joint Conference on Artificial Intelligence 1025–1031. Morgan Kaufmann, San Francisco, CA.
  • Orellana, L., Rotnitzky, A. and Robins, J. M. (2010). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, Part I: Main content. Int. J. Biostat. 6 Art. 8, 49.
  • Raghunathan, T. E., Solenberger, P. and Van Hoewyk, J. (2002). IVEware: Imputation and variance estimation software user guide. Survey Methodology Program, Univ. Michigan, Ann Arbor, MI.
  • Reif, S., George, P., Braude, L., Dougherty, R. H., Daniels, A. S., Ghose, S. S. and Delphin-Rittmon, M. E. (2014). Residential treatment for individuals with substance use disorders: Assessing the evidence. Psychiatr. Serv. (Wash. D.C.) 65 301–312.
  • Rivest, R. L. (1987). Learning decision lists. Mach. Learn. 2 229–246.
  • Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Model. 7 1393–1512.
  • Robins, J. M. (1994). Correcting for non-compliance in randomized trials using structural nested mean models. Comm. Statist. Theory Methods 23 2379–2412.
  • Robins, J. M. (1997). Causal inference from complex longitudinal data. In Latent Variable Modeling and Applications to Causality, 69–117. Springer, New York.
  • Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics, 189–326. Springer, New York.
  • Robins, J. M. and Hernán, M. A. (2009). Estimation of the causal effects of time-varying exposures. In Longitudinal Data Analysis, 553–599. CRC Press, Boca Raton, FL.
  • Rotnitzky, A., Robins, J. M. and Scharfstein, D. O. (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. J. Amer. Statist. Assoc. 93 1321–1339.
  • Schulte, P. J., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2014). $Q$- and $A$-learning methods for estimating optimal dynamic treatment regimes. Statist. Sci. 29 640–661.
  • Sutton, R. and Barto, A. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge.
  • Tao, Y. and Wang, L. (2017). Adaptive contrast weighted learning for multi-stage multi-treatment decision-making. Biometrics 73 145–155.
  • Tao, Y., Wang, L. and Almirall, D. (2018a). Supplement to “Tree-based reinforcement learning for estimating optimal dynamic treatment regimes.” DOI:10.1214/18-AOAS1137SUPPA.
  • Tao, Y., Wang, L. and Almirall, D. (2018b). Supplement to “Tree-based reinforcement learning for estimating optimal dynamic treatment regimes.” DOI:10.1214/18-AOAS1137SUPPB.
  • Thall, P. F., Wooten, L. H., Logothetis, C. J., Millikan, R. E. and Tannir, N. M. (2007). Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Stat. Med. 26 4687–4702.
  • van der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. Int. J. Biostat. 2 Art. 11, 40.
  • Wagner, E. H., Austin, B. T., Davis, C., Hindmarsh, M., Schaefer, J. and Bonomi, A. (2001). Improving chronic illness care: Translating evidence into action. Health Aff. (Millwood, Va.) 20 64–78.
  • Wang, L., Rotnitzky, A., Lin, X., Millikan, R. E. and Thall, P. F. (2012). Evaluation of viable dynamic treatment regimes in a sequentially randomized trial of advanced prostate cancer. J. Amer. Statist. Assoc. 107 493–508.
  • Watkins, C. J. and Dayan, P. (1992). Q-learning. Mach. Learn. 8 279–292.
  • Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M. and Laber, E. B. (2012). Estimating optimal treatment regimes from a classification perspective. Stat 1 103–114.
  • Zhang, Y., Laber, E. B., Tsiatis, A. and Davidian, M. (2015). Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71 895–904.
  • Zhang, Y., Laber, E. B., Tsiatis, A. and Davidian, M. (2016). Interpretable dynamic treatment regimes. arXiv preprint arXiv:1606.01472.
  • Zhao, Y., Zeng, D., Rush, A. J. and Kosorok, M. R. (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc. 107 1106–1118.
  • Zhao, Y.-Q., Zeng, D., Laber, E. B. and Kosorok, M. R. (2015). New statistical learning methods for estimating optimal dynamic treatment regimes. J. Amer. Statist. Assoc. 110 583–598.
  • Zhou, X., Mayer-Hamblett, N., Khan, U. and Kosorok, M. R. (2017). Residual weighted learning for estimating individualized treatment rules. J. Amer. Statist. Assoc. 112 169–187.
  • Zhu, R., Zeng, D. and Kosorok, M. R. (2015). Reinforcement learning trees. J. Amer. Statist. Assoc. 110 1770–1784.

Supplemental materials

  • Supplementary material A for article “Tree-based reinforcement learning for estimating optimal dynamic treatment regimes”. Additional simulation results for the proposed method and competing methods.
  • Supplementary material B for article “Tree-based reinforcement learning for estimating optimal dynamic treatment regimes”. R codes and sample data to implement the proposed method.