Statistical Science

$\mathbf{Q}$- and $\mathbf{A}$-Learning Methods for Estimating Optimal Dynamic Treatment Regimes

Phillip J. Schulte, Anastasios A. Tsiatis, Eric B. Laber, and Marie Davidian

Full-text: Open access


In clinical practice, physicians make a series of treatment decisions over the course of a patient’s disease based on his/her baseline and evolving characteristics. A dynamic treatment regime is a set of sequential decision rules that operationalizes this process. Each rule corresponds to a decision point and dictates the next treatment action based on the accrued information. Using existing data, a key goal is estimating the optimal regime, that, if followed by the patient population, would yield the most favorable outcome on average. Q- and A-learning are two main approaches for this purpose. We provide a detailed account of these methods, study their performance, and illustrate them using data from a depression study.

Article information

Statist. Sci., Volume 29, Number 4 (2014), 640-661.

First available in Project Euclid: 15 January 2015

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Advantage learning bias-variance trade-off model misspecification personalized medicine potential outcomes sequential decision-making


Schulte, Phillip J.; Tsiatis, Anastasios A.; Laber, Eric B.; Davidian, Marie. $\mathbf{Q}$- and $\mathbf{A}$-Learning Methods for Estimating Optimal Dynamic Treatment Regimes. Statist. Sci. 29 (2014), no. 4, 640--661. doi:10.1214/13-STS450.

Export citation


  • Almirall, D., Ten Have, T. and Murphy, S. A. (2010). Structural nested mean models for assessing time-varying effect moderation. Biometrics 66 131–139.
  • Bather, J. (2000). Decision Theory: An Introduction to Dynamic Programming and Sequential Decisions. Wiley, Chichester.
  • Blatt, D., Murphy, S. A. and Zhu, J. (2004). A-learning for approximate planning. Technical Report 04-63, The Methodology Center, Pennsylvania State Univ., State College, PA.
  • Chakraborty, B. and Moodie, E. E. M. (2012). Estimating optimal dynamic treatment regimes with shared decision rules across stages: An extension of Q-learning. Unpublished manuscript.
  • Chakraborty, B., Murphy, S. and Strecher, V. (2010). Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res. 19 317–343.
  • Craven, M. W. and Shavlik, J. W. (1996). Extracting tree-structured representations of trained networks. In Advances in Neural Information Processing Systems, 8 24–30. MIT Press, Cambridge, MA.
  • Henderson, R., Ansell, P. and Alshibani, D. (2010). Regret-regression for optimal dynamic treatment regimes. Biometrics 66 1192–1201.
  • Laber, E. B. and Murphy, S. A. (2011). Adaptive confidence intervals for the test error in classification. J. Amer. Statist. Assoc. 106 904–913.
  • Laber, E. B., Qian, M., Lizotte, D. J. and Murphy, S. A. (2010). Statistical inference in dynamic treatment regimes. Preprint. Available at arXiv:1006.5831v1.
  • Lavori, P. W. and Dawson, R. (2000). A design for testing clinical strategies: Biased adaptive within-subject randomization. J. Roy. Statist. Soc. Ser. A 163 29–38.
  • Moodie, E. E. M., Richardson, T. S. and Stephens, D. A. (2007). Demystifying optimal dynamic treatment regimes. Biometrics 63 447–455.
  • Moodie, E. E. M. and Richardson, T. S. (2010). Estimating optimal dynamic regimes: Correcting bias under the null. Scand. J. Stat. 37 126–146.
  • Murphy, S. A. (2003). Optimal dynamic treatment regimes. J. R. Stat. Soc. Ser. B Stat. Methodol. 65 331–366.
  • Murphy, S. A. (2005). An experimental design for the development of adaptive treatment strategies. Stat. Med. 24 1455–1481.
  • Murphy, S. A., Lynch, K. G., Oslin, D., McKay, J. R. and Ten Have, T. (2007a). Developing adaptive treatment strategies in substance abuse research. Drug Alcohol Depend. 88S S24–S30.
  • Murphy, S. A., Oslin, D. W., Rush, A. J. and Zhu, J. (2007b). Methodological challenges in constructing effective treatment sequences for chronic psychiatric disorders. Neuropsychoarmacology 32 257–262.
  • Nahum-Shani, I., Qian, M., Almirall, D., Pelham, W. E., Gnagy, B., Fabiano, G., Waxmonsky, J., Yu, J. and Murphy, S. A. (2010). Q-Learning: A data analysis method for constructing adaptive interventions. Technical report.
  • Orellana, L., Rotnitzky, A. and Robins, J. M. (2010). Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, Part I: Main content. Int. J. Biostat. 6 Art. 8, 49.
  • Richardson, T. S. and Robins, J. M. (2013). Single world intervention graphs (SWIGs): A unification of the counterfactual and graphical approaches to causality. Available at
  • Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period—Application to control of the healthy worker survivor effect. Math. Modelling 7 1393–1512.
  • Robins, J. M. (1994). Correcting for non-compliance in randomized trials using structural nested mean models. Comm. Statist. Theory Methods 23 2379–2412.
  • Robins, J. M. (2004). Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics. Lecture Notes in Statist. 179 189–326. Springer, New York.
  • Robins, J., Orellana, L. and Rotnitzky, A. (2008). Estimation and extrapolation of optimal treatment and testing strategies. Stat. Med. 27 4678–4721.
  • Rosenblum, M. and van der Laan, M. J. (2009). Using regression models to analyze randomized trials: Asymptotically valid hypothesis tests despite incorrectly specified models. Biometrics 65 937–945.
  • Rosthøj, S., Fullwood, C., Henderson, R. and Stewart, S. (2006). Estimation of optimal dynamic anticoagulation regimes from observational data: A regret-based approach. Stat. Med. 25 4197–4215.
  • Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6 34–58.
  • Rush, A. J., Trivedi, M. H., Ibrahim, H. M., Carmody, T. J., Arnow, B., Klein, D. N., Markowitz, J. C., Ninan, P. T., Kornstein, S., Manber, R., Thase, M. E., Kocsis, J. H. and Keller, M. B. (2003). The 16-item quick inventory of depressive symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): A psychometric evaluation in patients with chronic major depression. Biol. Psychiatry 54 573–583.
  • Rush, A. J., Fava, M., Wisniewski, S. R., Lavori, P. W., Trivedi, M. H., Sackeim, H. A., Thase, M. E., Nierenberg, A. A., Quitkin, F. M., Kashner, T. M., Kupfer, D. J., Rosenbaum, J. F., Alpert, J., Stewart, J. W., McGrath, P. J., Biggs, M. M., Shores-Wilson, K., Lebowitz, B. D., Ritz, L. and Niederehe, G. (2004). Sequenced treatment alternatives to relieve depression (STAR∗D): Rationale and design. Control Clin. Trials 25 119–142.
  • Schulte, P. J., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2014). Supplement to “Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes.” DOI:10.1214/13-STS450SUPP.
  • Shortreed, S. M., Laber, E., Lizotte, D. J., Stroup, T. S., Pineau, J. and Murphy, S. A. (2011). Informing sequential clinical decision-making through reinforcement learning: An empirical study. Mach. Learn. 84 109–136.
  • Song, R., Wang, W., Zeng, D. and Kosorok, M. R. (2010). Penalized q-learning for dynamic treatment regimes. Preprint. Available at arXiv:1108.5338v1.
  • Thall, P. F., Millikan, R. E. and Sung, H. G. (2000). Evaluating multiple treatment courses in clinical trials. Stat. Med. 19 1011–1028.
  • Thall, P. F., Sung, H.-G. and Estey, E. H. (2002). Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. J. Amer. Statist. Assoc. 97 29–39.
  • Thall, P. F., Wooten, L. H., Logothetis, C. J., Millikan, R. E. and Tannir, N. M. (2007). Bayesian and frequentist two-stage treatment strategies based on sequential failure times subject to interval censoring. Stat. Med. 26 4687–4702.
  • van der Laan, M. J. and Petersen, M. L. (2007). Causal effect models for realistic individualized treatment and intention to treat rules. Int. J. Biostat. 3 Art. 3, 54.
  • Watkins, C. J. C. H. (1989). Learning from delayed rewards. Ph.D. thesis. King’s College, Cambridge, UK.
  • Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Mach. Learn. 8 279–292.
  • Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M. and Laber, E. B. (2012a). Estimating optimal treatment regimes from a classification perspective. Stat 1 103–114.
  • Zhang, B., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2012b). A robust method for estimating optimal treatment regimes. Biometrics 68 1010–1018.
  • Zhang, B., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2013). Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100 681–694.
  • Zhao, Y., Kosorok, M. R. and Zeng, D. (2009). Reinforcement learning design for cancer clinical trials. Stat. Med. 28 3294–3315.
  • Zhao, Y., Zeng, D., Rush, A. J. and Kosorok, M. R. (2012). Estimating individualized treatment rules using outcome weighted learning. J. Amer. Statist. Assoc. 107 1106–1118.
  • Zhao, Y., Zeng, D., Laber, E. B. and Kosorok, M. R. (2013). New statistical learning methods for estimating optimal dynamic treatment regimes. Unpublished manuscript.

Supplemental materials

  • Supplementary material: Supplement to “Q- and A-Learning Methods for Estimating Optimal Dynamic Treatment Regimes”. Due to space constraints, technical details and further results are given in the supplementary document Schulte et al (2014).