## Electronic Journal of Statistics

### Learning vs earning trade-off with missing or censored observations: The two-armed Bayesian nonparametric beta-Stacy bandit problem

#### Abstract

Existing Bayesian nonparametric methodologies for bandit problems focus on exact observations, leaving a gap in those bandit applications where censored observations are crucial. We address this gap by extending a Bayesian nonparametric two-armed bandit problem to right-censored data, where each arm is generated from a beta-Stacy process as defined by Walker and Muliere (1997). We first show some properties of the expected advantage of choosing one arm over the other, namely the monotonicity in the arm response and, limited to the case of continuous state space, the continuity in the right-censored arm response. We partially characterize optimal strategies by proving the existence of stay-with-a-winner and stay-with-a-winner/switch-on-a-loser break-even points, under non-restrictive conditions that include the special cases of the simple homogeneous process and the Dirichlet process. Numerical estimations and simulations for a variety of discrete and continuous state space settings are presented to illustrate the performance and flexibility of our framework.

#### Article information

Source
Electron. J. Statist., Volume 11, Number 2 (2017), 3368-3406.

Dates
First available in Project Euclid: 6 October 2017

https://projecteuclid.org/euclid.ejs/1507255609

Digital Object Identifier
doi:10.1214/17-EJS1342

Mathematical Reviews number (MathSciNet)
MR3709858

Zentralblatt MATH identifier
1377.62033

Subjects
Primary: 62C10: Bayesian problems; characterization of Bayes procedures
Secondary: 62N01: Censored data models

#### Citation

Peluso, Stefano; Mira, Antonietta; Muliere, Pietro. Learning vs earning trade-off with missing or censored observations: The two-armed Bayesian nonparametric beta-Stacy bandit problem. Electron. J. Statist. 11 (2017), no. 2, 3368--3406. doi:10.1214/17-EJS1342. https://projecteuclid.org/euclid.ejs/1507255609

#### References

• Al Labadi, L. and Zarepour, M. (2013). A Bayesian nonparametric goodness of fit test for right censored data based on approximate samples from the beta-Stacy process., The Canadian Journal of Statistics 41 466-487.
• Antoniak, C. (1974). Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems., Annals of Statistics 2 1152-1174.
• Battiston, M., Favaro, S. and Teh, Y. W. (2016). Multi-armed bandit for species discovery: a Bayesian nonparametric approach., Journal of the American Statistical Association. Forthcoming.
• Bellman, R. (1956). A Problem in the Sequential Design of Experiments., Sankhya 16 221-229.
• Berry, D. A. (1972). A Bernoulli Two-Armed Bandit., The Annals of Mathematical Statistics 43 871-897.
• Berry, D. A. and Fristedt, B. (1979). Bernoulli One-Armed Bandits - Arbitrary Discount Sequences., The Annals of Statistics 7 1086-1105.
• Berry, D. A. and Fristedt, B. (1985)., Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, New York.
• Bradt, R. N., Johnson, S. M. and Karlin, S. (1956). On Sequential Designs for Maximizing the Sum of, n Observations. The Annals of Mathematical Statistics 33 847-856.
• Caro, F. and Yoo, O. S. (2010). Indexability of Bandit Problems with Response Delays., Probability in the Engineering and Informational Sciences 24 349-374.
• Cesa-Bianchi, N. and Fisher, P. (1998). Finite-time regret bounds for the multiarmed bandit problem. In, Proceedings of the 15th International Conference on Machine Learning 100-108.
• Chakravorty, J. and Mahajan, A. (2014). Multi-armed bandits, Gittins index, and its calculation. In, Methods and Applications of Statistics in Clinical Trials: Planning, Analysis, and Inferential Methods, Volume 2 416-435.
• Chattopadhyay, M. K. (1994). Two-Armed Dirichlet Bandits With Discounting., The Annals of Statistics 22 1212-1221.
• Chernoff, H. (1968). Optimal Stochastic Control., Sankhya 30 221-252.
• Clayton, M. K. and Berry, D. A. (1985). Bayesian Nonparametric Bandits., The Annals of Statistics 13 1523-1534.
• De Blasi, P. (2007). Simulation of the Beta-Stacy Process with Application to Analysis of Censored Data. In, Encyclopedia of Statistics in Quality and Reliability, F. Ruggeri, R.S. Kennt and F. Faltin 1814-1819.
• de Finetti, B. (1937). La prévision: ses lois logiques, ses sources subjectives., Annales de l’Institut Henri Poincaré 7 1-68.
• de Finetti, B. (1938). Sur la condition d’equivalence partielle, VI Colloque Geneve., Acta. Sci. Ind. Paris 739 5-18.
• de Finetti, B. (1959). La probabilitá e la statistica nei rapporti con l’induzione, secondo i diversi punti di vista. Atti corso CIME su Induzione e Statistica, Varenna.
• Doksum, K. A. (1974). Tailfree and neutral random probabilities and their posterior distributions., Annals of Probability 2 183-201.
• Eick, S. G. (1985). Two-armed bandits with delayed responses. University of Minnesota Statistics Technical Report, 456.
• Eick, S. G. (1988a). The two-armed bandit with delayed responses., The Annals of Statistics 16 254-264.
• Eick, S. G. (1988b). Gittins procedures for bandits with delayed responses., Journal of the Royal Statistical Society, Series B 50 125-132.
• Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems., The Annals of Statistics 1 209-230.
• Ferguson, T. S. and Phadia, E. G. (1979). Bayesian Nonparametric Estimation Based on Censored Data., The Annals of Statistics 7 163-186.
• Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. Available at, https://hal.archives-ouvertes.fr/hal-00281392.
• Gill, R. D. and Johansen, S. (1990). A Survey of Product Integration with a View Toward Application in Survival Analysis., The Annals of Statistics 18 1501-1555.
• Gittins, J. C. (1979). Bandit Processes and Dynamic Allocation Indices (with discussion)., Journal of the Royal Statistical Society, Series B 41 148-177.
• Gittins, J., Glazebrook andWeber, R. (2011)., Multi-armed Bandit Allocation Indices. John Wiley & Sons, Ltd, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom.
• Guha, S., Munagala, K. and Pál, M. (2013). Multi-armed bandit problems with delayed feedback. Available at, https://arxiv.org/abs/1306.3525.
• Hardwick, J., Oehmke, R. and Stout, Q. F. (1998). Adaptive allocation in the presence of missing outcomes., Computing Science and Statistics 30 219-223.
• Hardwick, J., Oehmke, R. and Stout, Q. F. (2001). Optimal adaptive designs for delayed response models: exponential case. In, MODA6: Model Oriented Data Analysis 127-134.
• Hardwick, J., Oehmke, R. and Stout, Q. F. (2006). New adaptive designs for delayed response models., Journal of Statistical Planning and Inference 136 1940-1955.
• Langford, J. and Zhang, T. (2008). The epoch-greedy algorithm for contextual multi-armed bandits. In, Advances in Neural Information Processing Systems 20 817-284.
• Muliere, P., Bulla, P. and Walker, S. (2007). Bayesian Nonparametric Estimation of Bivariate Survival Function., Statistica Sinica 17 427-444.
• Muliere, P., Paganoni, A. M. and Secchi, P. (2006). A randomly reinforced urn., Journal of Statistical Planning and Inference 136 1853-1874.
• Nash, P. (1973). Optimal Allocation of Resources Between Research Projects. Ph.D. thesis, Cambridge Univ., England.
• Phadia, E. G. (2013)., Prior Processes and Their Applications. Springer-Verlag, Berlin.
• Robbins, H. (1952). Some Aspects of the Sequential Design of Experiments., Bullettin of American Mathematical Society 58 527-535.
• Susarla, V. and Van Ryzin, J. (1976). Nonparametric Bayesian estimation of survival curves from incomplete observations., Journal of the American Statistical Association 71 897-902.
• Sutton, R. S. and Barto, A. G. (1998)., Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts.
• Tokic, M. (2010). Adaptive $\epsilon$-greedy exploration in reinforcement learning based on value differences. In, KI 2010: Advances in Artificial Intelligence, Lecture Notes in Computer Science 203-210.
• Walker, S. and Muliere, P. (1997). Beta-Stacy Processes and a Generalization of the Pólya-Urn Scheme., The Annals of Statistics 25 1762-1780.
• Walker, S. and Muliere, P. (2003). A Bivariate Dirichlet Process., Statistics and Probability Letters 64 1-7.
• Wang, X. (2000). A bandit process with delayed responses., Statistics & Probability Letters 48 303-307.
• Wang, X. and Bickis, M. G. (2003). One-armed bandit models with continuous and delayed responses., Mathematical Methods of Operations Research 58 209-219.
• Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge Univ., England.
• Whittle, F. (1988). Restless bandits: activity allocation in a changing world., Journal of Applied Probability 25 287-298.
• Yu, Y. (2011). Prior Ordering and Monotonicity in Dirichlet Bandits. Working, paper.