The Annals of Statistics

On Bayesian index policies for sequential resource allocation

Emilie Kaufmann

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text


This paper is about index policies for minimizing (frequentist) regret in a stochastic multi-armed bandit model, inspired by a Bayesian view on the problem. Our main contribution is to prove that the Bayes-UCB algorithm, which relies on quantiles of posterior distributions, is asymptotically optimal when the reward distributions belong to a one-dimensional exponential family, for a large class of prior distributions. We also show that the Bayesian literature gives new insight on what kind of exploration rates could be used in frequentist, UCB-type algorithms. Indeed, approximations of the Bayesian optimal solution or the Finite-Horizon Gittins indices provide a justification for the kl-UCB$^{+}$ and kl-UCB-H$^{+}$ algorithms, whose asymptotic optimality is also established.

Article information

Ann. Statist., Volume 46, Number 2 (2018), 842-865.

Received: September 2016
Revised: March 2017
First available in Project Euclid: 3 April 2018

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62L05: Sequential design

Multi-armed bandit problems Bayesian methods upper-confidence bounds Gittins indices


Kaufmann, Emilie. On Bayesian index policies for sequential resource allocation. Ann. Statist. 46 (2018), no. 2, 842--865. doi:10.1214/17-AOS1569.

Export citation


  • [1] Agrawal, R. (1995). Sample mean based index policies with $O(\log n)$ regret for the multi-armed bandit problem. Adv. in Appl. Probab. 27 1054–1078.
  • [2] Agrawal, S. and Goyal, N. (2012). Analysis of Thompson sampling for the multi-armed bandit problem. In Proceedings of the 25th Conference on Learning Theory.
  • [3] Agrawal, S. and Goyal, N. (2013). Further optimal regret bounds for Thompson sampling. In Proceedings of the 16th Conference on Artificial Intelligence and Statistics.
  • [4] Audibert, J.-Y. and Bubeck, S. (2010). Regret bounds and minimax policies under partial monitoring. J. Mach. Learn. Res. 11 2785–2836.
  • [5] Audibert, J.-Y., Munos, R. and Szepesvári, C. (2009). Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoret. Comput. Sci. 410 1876–1902.
  • [6] Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47 235–256.
  • [7] Bellman, R. (1956). A problem in the sequential design of experiments. Sankhyā 16 221–229.
  • [8] Berry, D. A. and Fristedt, B. (1985). Bandit Problems: Sequential Allocation of Experiments. Chapman & Hall, London.
  • [9] Boucheron, S., Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford Univ. Press, Oxford.
  • [10] Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On sequential designs for maximizing the sum of $n$ observations. Ann. Math. Stat. 27 1060–1074.
  • [11] Brochu, E., Cora, V. M. and De Freitas, N. (2010). A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical report, Univ. British Columbia.
  • [12] Bubeck, S. and Liu, C.-Y. (2013). Prior-free and prior-dependent regret bounds for Thompson sampling. In Advances in Neural Information Processing Systems.
  • [13] Burnetas, A. N. and Katehakis, M. N. (2003). Asymptotic Bayes analysis for the finite-horizon one-armed-bandit problem. Probab. Engrg. Inform. Sci. 17 53–82.
  • [14] Cappé, O., Garivier, A., Maillard, O.-A., Munos, R. and Stoltz, G. (2013). Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist. 41 1516–1541.
  • [15] Chang, F. and Lai, T. L. (1987). Optimal stopping and dynamic allocation. Adv. in Appl. Probab. 19 829–853.
  • [16] Chapelle, O. and Li, L. (2011). An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems.
  • [17] Garivier, A. and Cappé, O. (2011). The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Conference on Learning Theory.
  • [18] Ginebra, J. and Clayton, M. K. (1999). Small-sample performance of Bernoulli two-armed bandit Bayesian strategies. J. Statist. Plann. Inference 79 107–122.
  • [19] Gittins, J. C. (1979). Bandit processes and dynamic allocation indices. J. Roy. Statist. Soc. Ser. B 41 148–177.
  • [20] Gittins, J., Glazebrook, K. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd ed. Wiley, Chichester.
  • [21] Honda, J. and Takemura, A. (2010). An asymptotically optimal bandit algorithm for bounded support models. In Proceedings of the 23rd Conference on Learning Theory.
  • [22] Honda, J. and Takemura, A. (2014). Optimality of Thompson sampling for Gaussian bandits depends on priors. In Proceedings of the 17th Conference on Artificial Intelligence and Statistics.
  • [23] Kaufmann, E. (2018). Supplement to “On Bayesian index policies for sequential resource allocation.” DOI:10.1214/17-AOS1569SUPP.
  • [24] Kaufmann, E., Cappé, O. and Garivier, A. (2012). On Bayesian upper-confidence bounds for bandit problems. In Proceedings of the 15th Conference on Artificial Intelligence and Statistics.
  • [25] Kaufmann, E., Korda, N. and Munos, R. (2012). Thompson sampling: An asymptotically optimal finite-time analysis. In Algorithmic Learning Theory. Lecture Notes in Computer Science 7568 199–213. Springer, Heidelberg.
  • [26] Korda, N., Kaufmann, E. and Munos, R. (2013). Thompson sampling for 1-dimensional exponential family bandits. In Advances in Neural Information Processing Systems.
  • [27] Lai, T. L. (1987). Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15 1091–1114.
  • [28] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math. 6 4–22.
  • [29] Lattimore, T. (2016). Regret analysis of the finite-horizon Gittins index strategy for multi-armed bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016 1214–1245. Available at
  • [30] Liu, C.-Y. and Li, L. (2016). On the prior sensitivity of Thompson sampling. In Algorithmic Learning Theory. Lecture Notes in Computer Science 9925 321–336. Springer, Cham.
  • [31] Niño-Mora, J. (2011). Computing a classic index for finite-horizon bandits. INFORMS J. Comput. 23 254–267.
  • [32] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
  • [33] Reverdy, P., Srivastava, V. and Leonard, N. E. (2014). Modeling human decision making in generalized Gaussian multiarmed bandits. Proc. IEEE 102 544–571.
  • [34] Russo, D. and Van Roy, B. (2014). Learning to optimize via posterior sampling. Math. Oper. Res. 39 1221–1243.
  • [35] Russo, D. and Van Roy, B. (2014). Learning to optimize via information direct sampling. In Advances in Neural Information Processing Systems.
  • [36] Scott, S. L. (2010). A modern Bayesian look at the multi-armed bandit. Appl. Stoch. Models Bus. Ind. 26 639–658.
  • [37] Srinivas, N., Krause, A., Kakade, S. M. and Seeger, M. W. (2010). Gaussian process optimization in the bandit setting: No regret and experimental design. In Proceedings of the International Conference on Machine Learning.
  • [38] Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 285–294.

Supplemental materials

  • Technical proofs. The supplemental article contains the proofs of some results stated in the paper.