Q-learning with Logarithmic Regret

International Conference on Artificial Intelligence and Statistics (AISTATS), 2021

Download PDF here

Abstract: This paper presents the first non-asymptotic result showing that a model-free algorithm can achieve a logarithmic cumulative regret for episodic tabular reinforcement learning if there exists a strictly positive sub-optimality gap in the optimal Q-function. We prove that the optimistic Q-learning studied in [Jin et al. 2018] enjoys a $\frac{SA\text{poly}(H)}{\Delta_{\min}}\log(SAT)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $\Delta_{\min}$ is the minimum sub-optimality gap. This bound matches the information theoretical lower bound in terms of S,A,T up to a $\log(SA)$ factor. We further extend our analysis to the discounted setting and obtain a similar logarithmic cumulative regret bound.