隠れ状態を持つ多腕バンディット問題における方策の検討

工藤 亘平; 竹川 高志

doi:10.11517/pjsai.JSAI2019.0_3Rin205

Abstract

The Bandit problem is a matter of maximizing the current reward by selecting one out of the options and acquiring the reward, while limiting it to one state. Reinforcement learning is a problem of maximizing rewards earned in the future by performing various actions from options, in the presence of multiple states. The difference between the two is that state information is known, and multiple states are taken into account. In this simulation, we consider a model in which the current state and state transition information is unknown, maintaining one state for a certain period of time and then transitioning to another state. Regarding this model, we compare the general Bandit problem policy and reinforcement learning policy by cumulative reward. As a result, the cumulative reward was higher for the reinforcement learning policy than for the Bandit problem policy.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!