Reinforcement learning from comparisons: Three alternatives are enough, two are not

Benoît Laslier; Jean-François Laslier

doi:10.1214/16-AAP1271

October 2017 Reinforcement learning from comparisons: Three alternatives are enough, two are not

Benoît Laslier, Jean-François Laslier

Ann. Appl. Probab. 27(5): 2907-2925 (October 2017). DOI: 10.1214/16-AAP1271

Abstract

This paper deals with two generalizations of the Polya urn model where, instead of sampling one ball from the urn at each time, we sample two or three balls. The processes are defined on the basis of the problem of finding the best alternative using pairwise comparisons which are not necessarily transitive: they can be thought of as evolutionary processes that tend to reinforce currently efficient alternatives. The two processes exhibit different behaviors: with three balls sampled, we prove almost sure convergence towards the unique optimal solution of the comparisons problem while, in some cases, the process with two balls sampled has almost surely no limit. This is an example of a natural reinforcement model with no exchangeability whose asymptotic behavior can be precisely characterized.

Citation

Download Citation

Benoît Laslier. Jean-François Laslier. "Reinforcement learning from comparisons: Three alternatives are enough, two are not." Ann. Appl. Probab. 27 (5) 2907 - 2925, October 2017. https://doi.org/10.1214/16-AAP1271