Electronic Journal of Statistics

Randomized allocation with arm elimination in a bandit problem with covariates

Wei Qian and Yuhong Yang

Full-text: Open access


Motivated by applications in personalized web services and clinical research, we consider a multi-armed bandit problem in a setting where the mean reward of each arm is associated with some covariates. A multi-stage randomized allocation with arm elimination algorithm is proposed to combine the flexibility in reward function modeling and a theoretical guarantee of a cumulative regret minimax rate. When the function smoothness parameter is unknown, the algorithm is equipped with a histogram estimation based smoothness parameter selector using Lepski’s method, and is shown to maintain the regret minimax rate up to a logarithmic factor under a “self-similarity” condition.

Article information

Electron. J. Statist., Volume 10, Number 1 (2016), 242-270.

Received: October 2014
First available in Project Euclid: 17 February 2016

Permanent link to this document

Digital Object Identifier

Mathematical Reviews number (MathSciNet)

Zentralblatt MATH identifier

Primary: 62G08: Nonparametric regression
Secondary: 62L05: Sequential design

Contextual bandit problem MABC nonparametric bandit adaptive estimation regret bound


Qian, Wei; Yang, Yuhong. Randomized allocation with arm elimination in a bandit problem with covariates. Electron. J. Statist. 10 (2016), no. 1, 242--270. doi:10.1214/15-EJS1104. https://projecteuclid.org/euclid.ejs/1455715962

Export citation


  • Audibert, J.-Y. and Tsybakov, A. B. (2005). Fast learning rates for plug-in classifiers under the margin condition., arXiv preprint math/0507180.
  • Audibert, J.-Y. and Tsybakov, A. B. (2007). Fast learning rates for plug-in classifiers., The Annals of Statistics 35 608–633.
  • Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem., Machine Learning 47 235–256.
  • Auer, P., Ortner, R. and Szepesvári, C. (2007). Improved rates for the stochastic continuum-armed bandit problem. In, Proceedings of 20th Annual Conference on Learning Theory.
  • Berry, D. A. and Fristedt, B. (1985)., Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, New York.
  • Birgé, L. and Massart, P. (1998). Minimum contrast estimators on sieves: exponential bounds and rates of convergence., Bernoulli 4 329–375.
  • Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and non stochastic multi-armed bandit problems., Foundations and Trends in Machine Learning 5 1–122.
  • Bull, A. D. (2012). Honest adaptive confidence bands and self-similar functions., Electronic Journal of Statistics 6 1490–1516.
  • Cesa-Bianchi, N. and Lugosi, G. (2006)., Prediction, Learning and Games. Cambridge University Press, Cambridge, UK.
  • Dani, V., Hayes, T. P. and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. In, Proceedings of 21st Annual Conference on Learning Theory 355–366.
  • Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L. and Zhang, T. (2011). Efficient optimal learning for contextual bandits. In, Proceedings of 27th Annual Conference on Uncertainty in Artificial Intelligence.
  • Even-Dar, E., Mannor, S. and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems., Journal of Machine Learning Research 7 1079–1105.
  • Giné, E. and Nickl, R. (2010). Confidence bands in density estimation., The Annals of Statistics 38 1122–1170.
  • Gittins, J. C. (1989)., Multi-Armed Bandit Allocation Indices. Wiley, New York.
  • Goldenshluger, a. and Zeevi, A. (2009). Woodrooofe’s one-armed bandit problem revisited., The Annals of Applied Probability 19 1603–1633.
  • Goldenshluger, A. and Zeevi, A. (2013). A linear response bandit problem., Stochastic Systems 3 230–261.
  • Härdle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. (1998)., Wavelets, Approximation, and Statistical Applications. Lecture Notes in Statistics. Springer, New York.
  • Hoffmann, M. and Nickl, R. (2011). On adaptive inference and confidence bands., The Annals of Statistics 39 2383–2409.
  • Kleinberg, R., Slivkins, A. and Upfal, E. (2007). Multi-armed bandits in metric spaces. In, Proceedings of 40th Symposium on Theory of Computing.
  • Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules., Advances in Applied Mathematics 6 4–22.
  • Langford, J. and Zhang, T. (2007). The epoch-greedy algorithm for contextual multi-armed bandits. In, Proceedings of 21th Conference on Neural Information Processing Systems.
  • Lepski, O. V. (1990). On a problem of adaptive estimation in Gaussian white noise., Theory of Probability & Its Applications 35 454–466.
  • Lepski, O. V., Mammen, E. and Spokoiny, V. G. (1997). Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors., The Annals of Statistics 25 929–947.
  • Li, L., Chu, W., Langford, J. and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In, Proceedings of 19th International World Wide Web Conference.
  • Low, M. G. (1997). On nonparametric confidence intervals., The Annals of Statistics 25 2547–2554.
  • Lu, T., Pál, D. and Pál, M. (2010). Showing relevant ads via Lipschitz context multi-armed bandits. In, Proceedings of 14th International Conference on Artificial Intelligence and Statistics.
  • Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis., The Annals of Statistics 27 1808–1829.
  • May, B. C., Korda, N., A. Lee and Leslie, D. S. (2012). Optimistic Bayesian sampling in contextual-bandit problems., Journal of Machine Learning Research 13 2069–2106.
  • Perchet, V. and Rigollet, P. (2013). The multi-armed bandit problem with covariates., The Annals of Statistics 41 693–721.
  • Qian, W. and Yang, Y. (2016). Kernel estimation and model combination in a bandit problem with covariates., Journal of Machine Learning Research accepted.
  • Rigollet, P. and Zeevi, A. (2010). Nonparametric bandits with covariates. In, Proceedings of 23rd International Conference on Learning Theory 54–66. Omnipress.
  • Robbins, H. (1954). Some aspects of the sequential design of experiments., Bulletin of the American Mathematical Society 58 527–535.
  • Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits., Mathematics of Operations Research 35 395–411.
  • Slivkins, A. (2011). Contextual bandits with similarity information. In, Proceedings of 24th Annual Conference on Learning Theory 679–702.
  • Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning., The Annals of Statistics 32 135–166.
  • Woodroofe, M. (1979). A one-armed bandit problem with a concomitant variable., Journal of the American Statistical Association 74 799–806.
  • Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates., The Annals of Statistics 30 100–121.