人工知能学会論文誌
Online ISSN : 1346-8030
Print ISSN : 1346-0714
ISSN-L : 1346-0714
原著論文
時間変化に関する外部情報を考慮した非定常多腕バンディット問題
難波 博之
著者情報
ジャーナル フリー

2021 年 36 巻 3 号 p. D-K84_1-11

詳細
抄録

Multi-armed bandit problem is a fundamental mathematical problem in sequential optimization and reinforcementlearning that has a variety of application such as online recommendation system and clinical trial design. Multiarmedbandit problem can describe a situation in which a player tries to select a good choice sequentially from givencandidate choices to maximize the cumulative reward. In this paper, we consider the non-stationary multi-armed banditproblems. Non-stationary means the reward distribution of each arm varies with time. We point out that in somereal application, we can utilize information on the change of reward distribution. Especially we consider the type ofinformation that may restrict the rounds at which the reward distribution changes. Against such scenario, we proposea novel strategy called PM policy. The proposed policy is based on existing CUSUM-UCB policy and M-UCB policythat do not consider external information. Though such existing policies monitor all arms to detect the change ofreward distribution, our policy monitors only important arms and rounds. As a result, the ratio of unnecessary monitoringis reduced, and an efficient search can be performed. The regret bound of the proposed policy is described. Wealso show the effectiveness of the proposed method by numerical experiments.

著者関連情報
© 人工知能学会2021
前の記事
feedback
Top