A Multi-Armed Bandit Problem with the Optimal Arm Depending on a Hidden Markov Model

Authors

Image provided by Talha Cihad Gulcu
Talha Cihad
Gulcu
University of Maryland

Abstract

We consider a novel multi-armed bandit setup in which the reward distribution of each arm depends on a single discrete Markov process. This setup involves correlation among arms, as well as correlation among each time instant when one of the arms is pulled. For this problem we show that the cumulative regret has to grow linearly with the number of instances where the outcome of the previous arm pull cannot be determined {\em uniquely}. We propose an algorithm relying on the empirical transition matrix and analyze its performance. The algorithm is shown to ensure that the contribution of regret from time instances where the outcome of the previous arm pull can be identified uniquely is minimized. This implies that the algorithm performs order-wise optimally. We experimentally show that our algorithm can perform better than the correlated-UCB algorithm introduced by Gupta et. al. in 2018 and the classical UCB algorithm.

Paper Manuscript