MarcoPolo : 報酬獲得と環境同定のトレードオフを考慮した強化学習システム

宮崎 和光; 山村 雅幸; 小林 重信

doi:10.11517/jjsai.12.1_78

抄録

Reinforcement learning is a kind of machine learning. It aims to adapt an agent to a given environment with a clue to rewards. Profit sharing (PS) can get rewards efficiently at an initial learning phase. However, it can not always learn an optimum policy that maximizes rewards per an action. Though Q-learning is guaranteed to obtain an optimum policy, it needs numerous trials to learn it. On Markov decision processes (MDPs), if a correct environment model is identified, we can derive an optimum policy by applying Policy Iteration Algorithm (PIA). As an efficient method for identifying MDPs, k-Certainty Exploration Method has been proposed. We consider that ideal reinforcement learning systems are to get some rewards even at an initial learning phase and to get mere rewards as the identification of environments proceeds. In this paper, we propose a unified learning system : MarcoPolo which considers both getting rewards by PS or PIA and identifying the environment by k-Certainty Exploration Method. MarcoPolo can realize any tradeoff between exploitation and exploration through the whole learning process. By applying MarcoPolo to an example, its basic performance is shown. Moreover, by applying it to Sutton's maze problem and its modified version, its feasibility on more realistic domains is shown.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PDF閲覧時に認証を求められる記事がございます（発行後2年間）が，人工知能学会の個人会員は無料で閲覧可能です．認証のための購読者番号やパスワードは会員マイページ（ユース会員の場合はジュニア・ユース会員サイト）にログインし「お知らせ」にてご確認下さい（会員情報管理システムとオンラインで連携していないため，パスワードは同システムとは異なります．また，認証情報の更新は偶数月の月末に実施しております．新規入会された方は利用できるまでしばらくお待ちください）．個人会員以外は記事複製申込フォームから購入いただけます．また，アマゾンにて冊子版あるいはKindle版を購入いただくことも可能です．

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）