Open Access
February 2024 Settling the sample complexity of model-based offline reinforcement learning
Gen Li, Laixi Shi, Yuxin Chen, Yuejie Chi, Yuting Wei
Author Affiliations +
Ann. Statist. 52(1): 233-260 (February 2024). DOI: 10.1214/23-AOS2342

Abstract

This paper is concerned with offline reinforcement learning (RL), which learns using precollected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior results either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications.

We demonstrate that the model-based (or “plug-in”) approach achieves minimax-optimal sample complexity without any burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a γ-discounted infinite-horizon (resp., finite-horizon) MDP with S states and effective horizon 11γ (resp., horizon H), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient Cclipped. We prove that model-based offline RL yields ε-accuracy with a sample complexity of

SCclipped(1γ)3ε2(infinite-horizon MDPs),H4SCclippedε2(finite-horizon MDPs),

up to log factor, which is minimax optimal for the entire ε-range. The proposed algorithms are “pessimistic” variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs.

Acknowledgments

Y. Wei is supported by the Google Research Scholar Award, and NSF Grants CCF-2106778, DMS-2147546/2015447 and CAREER award DMS-2143215.

Y. Chen is supported by the Alfred P. Sloan Research Fellowship, the Google Research Scholar Award, the AFOSR grants FA9550-22-1-0198, the ONR Grant N00014-22-1-2354 and NSF Grants CCF-2221009, CCF-1907661, DMS-2014279, IIS-2218713 and IIS-2218773.

L. Shi and Y. Chi are supported by the grants ONR N00014-19-1-2404, NSF CCF-2106778 and DMS-2134080, and CAREER award ECCS-1818571. L. Shi was also gratefully supported by the Leo Finzi Memorial Fellowship, Wei Shen and Xuehong Zhang Presidential Fellowship and Liang Ji-Dian Graduate Fellowship at CMU.

Y. Wei is the corresponding author.

Citation

Download Citation

Gen Li. Laixi Shi. Yuxin Chen. Yuejie Chi. Yuting Wei. "Settling the sample complexity of model-based offline reinforcement learning." Ann. Statist. 52 (1) 233 - 260, February 2024. https://doi.org/10.1214/23-AOS2342

Information

Received: 1 February 2023; Revised: 1 November 2023; Published: February 2024
First available in Project Euclid: 7 March 2024

MathSciNet: MR4718414
Digital Object Identifier: 10.1214/23-AOS2342

Subjects:
Primary: 62C20

Keywords: distribution shift , Markov decision process , Minimax optimality , offline reinforcement learning , sample complexity

Rights: Copyright © 2024 Institute of Mathematical Statistics

Vol.52 • No. 1 • February 2024
Back to Top