Advances in Applied Probability

Optimal learning with non-Gaussian rewards

Zi Ding and Ilya O. Ryzhov

Full-text: Access denied (no subscription detected)

We're sorry, but we are unable to provide you with the full text of this article because we are not able to identify you as a subscriber. If you have a personal subscription to this journal, then please login. If you are already logged in, then you may need to update your profile to register your subscription. Read more about accessing full-text

Abstract

We propose a novel theoretical characterization of the optimal 'Gittins index' policy in multi-armed bandit problems with non-Gaussian, infinitely divisible reward distributions. We first construct a continuous-time, conditional Lévy process which probabilistically interpolates the sequence of discrete-time rewards. When the rewards are Gaussian, this approach enables an easy connection to the convenient time-change properties of a Brownian motion. Although no such device is available in general for the non-Gaussian case, we use optimal stopping theory to characterize the value of the optimal policy as the solution to a free-boundary partial integro-differential equation (PIDE). We provide the free-boundary PIDE in explicit form under the specific settings of exponential and Poisson rewards. We also prove continuity and monotonicity properties of the Gittins index in these two problems, and discuss how the PIDE can be solved numerically to find the optimal index value of a given belief state.

Article information

Source
Adv. in Appl. Probab., Volume 48, Number 1 (2016), 112-136.

Dates
First available in Project Euclid: 8 March 2016

Permanent link to this document
https://projecteuclid.org/euclid.aap/1457466158

Mathematical Reviews number (MathSciNet)
MR3473570

Zentralblatt MATH identifier
1345.60039

Subjects
Primary: 60G40: Stopping times; optimal stopping problems; gambling theory [See also 62L15, 91A60]
Secondary: 60J75: Jump processes

Keywords
Gittins indices optimal learning multi-armed bandit non-Gaussian rewards probabilistic interpolation

Citation

Ding, Zi; Ryzhov, Ilya O. Optimal learning with non-Gaussian rewards. Adv. in Appl. Probab. 48 (2016), no. 1, 112--136. https://projecteuclid.org/euclid.aap/1457466158


Export citation

References

  • Aalto, S., Ayesta, U. and Righter, R. (2011). Properties of the Gittins index with application to optimal scheduling. Prob. Eng. Inf. Sci. 25, 269–288.
  • Agarwal, D., Chen, B.-C. and Elango, P. (2009). Explore/exploit schemes for web content optimization. In Proceedings of the 9th IEEE International Conference on Data Mining, IEEE, New York, pp. 1–10.
  • Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, 235–256.
  • Berry, D. A. and Pearson, L. M. (1985). Optimal designs for clinical trials with dichotomous responses. Statist. Medicine 4, 497–508.
  • Brezzi, M. and Lai, T. L. (2002). Optimal learning and experimentation in bandit problems. J. Econom. Dynamics Control 27, 87–108.
  • Buonaguidi, B. and Muliere, P. (2013). Sequential testing problems for Lévy processes. Sequential Anal. 32, 47–70.
  • Caro, F. and Gallien, J. (2007). Dynamic assortment with demand learning for seasonal consumer goods. Manag. Sci. 53, 276–292.
  • Chhabra, M. and Das, S. (2011). Learning the demand curve in posted-price digital goods auctions. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, pp. 63–70.
  • Chick, S. E. (2006). Subjective probability and Bayesian methodology. In Handbooks in Operations Research and Management Science, Vol. 13, Simulation, North-Holland, Amsterdam, pp. 225–258.
  • Chick, S. E. and Frazier, P. I. (2012). Sequential sampling with economics of selection procedures. Manag. Sci. 58, 550–569.
  • Chick, S. E. and Gans, N. (2009). Economic analysis of simulation selection problems. Manag. Sci. 55, 421–437.
  • Chick, S. E. and Inoue, K. (2001). New procedures to select the best simulated system using common random numbers. Manag. Sci. 47, 1133–1149.
  • Çinlar, E. (2003). Conditional Lévy processes. Comput. Math. Appl. 46, 993–997.
  • Çinlar, E. (2011). Probability and Stochastics. Springer, New York.
  • Cohen, A. and Solan, E. (2013). Bandit problems with Lévy processes. Math. Operat. Res. 38, 92–107.
  • Coquet, F. and Toldo, S. (2007). Convergence of values in optimal stopping and convergence of optimal stopping times. Electron. J. Prob. 12, 207–228.
  • DeGroot, M. H. (1970). Optimal Statistical Decisions. McGraw-Hill, New York.
  • Dynkin, E. B. (1965). Markov Processes. Academic Press, New York.
  • El Karoui, N. and Karatzas, I. (1994). Dynamic allocation problems in continuous time. Ann. Appl. Prob. 4, 255–286.
  • Farias, V. F. and Van Roy, B. (2010). Dynamic pricing with a prior on market response. Operat. Res. 58, 16–29.
  • Filliger, R. and Hongler, M.-O. (2007). Explicit Gittins indices for a class of superdiffusive processes. J. Appl. Prob. 44, 554–559.
  • Frazier, P. I. and Powell, W. B. (2011). Consistency of sequential Bayesian sampling policies. SIAM J. Control Optimization 49, 712–731.
  • Frazier, P. I., Powell, W. B. and Dayanik, S. (2008). A knowledge-gradient policy for sequential information collection. SIAM J. Control Optimization 47, 2410–2439.
  • Gittins, J. C. and Jones, D. M. (1979). A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika 66, 561–565.
  • Gittins, J. C. and Wang, Y.-G. (1992). The learning component of dynamic allocation indices. Ann. Statist. 20, 1625–1636.
  • Gittins, J. C., Glazebrook, K. D. and Weber, R. (2011). Multi-Armed Bandit Allocation Indices, 2nd edn. John Wiley, Oxford.
  • Glazebrook, K. D. and Minty, R. (2009). A generalized Gittins index for a class of multiarmed bandits with general resource requirements. Math. Operat. Res. 34, 26–44.
  • Glazebrook, K. D., Meissner, J. and Schurr, J. (2013). How big should my store be? On the interplay between shelf-space, demand learning and assortment decisions. Working paper, Lancaster University.
  • Itô, K., Barndorff-Nielsen, O. E. and Sato, K.-I. (2004). Stochastic Processes: Lectures Given at Aarhus University. Springer, Berlin.
  • Jouini, W. and Moy, C. (2012). Channel selection with Rayleigh fading: a multi-armed bandit framework. In Proceedings of the 13th IEEE International Workshop on Signal Processing Advances in Wireless Communications, IEEE, New York, pp. 299–303.
  • Kaspi, H. and Mandelbaum, A. (1995). Lévy bandits: multi-armed bandits driven by Lévy processes. Ann. Appl. Prob. 5, 541–565.
  • Katehakis, M. N. and Veinott, A. F., Jr. (1987). The multi-armed bandit problem: decomposition and computation. Math. Operat. Res. 12, 262–268.
  • Kyprianou, A. E. (2006). Introductory Lectures on Fluctuations of Lévy Processes with Applications. Springer, Berlin.
  • Lamberton, D. and Pagès, G. (1990). Sur l'approximation des réduites. Ann. Inst. H. Poincaré Prob. Statist. 26, 331–355.
  • Lariviere, M. A. and Porteus, E. L. (1999). Stalking information: Bayesian inventory management with unobserved lost sales. Manag. Sci. 45, 346–363.
  • Mandelbaum, A. (1986). Discrete multiarmed bandits and multiparameter processes. Prob. Theory Relat. Fields 71, 129–147.
  • Mandelbaum, A. (1987). Continuous multi-armed bandits and multiparameter processes. Ann. Prob. 15, 1527–1556.
  • Monroe, I. (1978). Processes that can be embedded in Brownian motion. Ann. Prob. 6, 42–56.
  • Müller, A. (1997). How does the value function of a Markov decision process depend on the transition probabilities? Math. Operat. Res. 22, 872–885.
  • Müller, A. and Stoyan, D. (2002). Comparison Methods for Stochastic Models and Risks. John Wiley, Chichester.
  • Peskir, G. and Shiryaev, A. N. (2006). Optimal Stopping and Free-Boundary Problems. Birkhäuser, Basel.
  • Powell, W. B. and Ryzhov, I. O. (2012). Optimal Learning. John Wiley, Hoboken, NJ.
  • Ryzhov, I. O. and Powell, W. B. (2011). The value of information in multi-armed bandits with exponentially distributed rewards. In Proceedings of the 2011 International Conference on Computational Science, pp. 1363–1372.
  • Ryzhov, I. O., Powell, W. B. and Frazier, P. I. (2012). The knowledge gradient algorithm for a general class of online learning problems. Operat. Res. 60, 180–195.
  • Sato, K.-I. (1999). Lévy Processes and Infinitely Divisible Distributions. Cambridge University Press.
  • Shaked, M. and Shanthikumar, J. G. (2007). Stochastic Orders. Springer, New York.
  • Steele, J. M. (2001). Stochastic Calculus and Financial Applications. Springer, New York.
  • Van Moerbeke, P. (1976). On optimal stopping and free boundary problems. Arch. Rational Mech. Anal. 60, 101–148.
  • Vazquez, E. and Bect, J. (2010). Convergence properties of the expected improvement algorithm with fixed mean and covariance functions. J. Statist. Planning Infer. 140, 3088–3095.
  • Wang, X. and Wang, Y. (2010). Optimal investment and consumption with stochastic dividends. Appl. Stoch. Models Business Industry 26, 792–808.
  • Yao, Y.-C. (2006). Some results on the Gittins index for a normal reward process. In Time Series and Related Topics, Institute of Mathematical Statistics, Beachwood, OH, pp. 284–294.
  • Yu, Y. (2011). Structural properties of Bayesian bandits with exponential family distributions. Preprint. Available at http://arxiv.org/abs/1103.3089.
  • Zhang, Q., Seetharaman, P. B. and Narasimhan, C. (2012). The indirect impact of price deals on households' purchase decisions through the formation of expected future prices. J. Retailing 88, 88–101. \endharvreferences