Actorに適正度の履歴を用いたActor-Criticアルゴリズム : 不完全なValue-Functionのもとでの強化学習

木村 元; 小林 重信

doi:10.11517/jjsai.15.2_267

抄録

We present an analysis of actor-critic algorithms, in which the actor updates its policy using eligibility traces of the policy parameters. Most of the theoretical results for eligibility traces have been for only critic's value iteration algorithms. This paper investigates what the actor's eligibility trace does. The results show that the algorithm is an extension of Williams' REINFORCE algorithms for infinite horizon reinforcement tasks, and then the critic provides an appropriate reinforcement baseline for the actor. Thanks to the actor's eligibility trace, the actor improves its policy by using a gradient of actual return, not by using a gradient of the estimated return in the critic. It enables the agent to learn a fairly good policy under the condition that the approximated value function in the critic is hopelessly inaccurate for conventional actor-critic algorithms. Also, if an accurate value function is estimated by the critic, the actor's learning is dramatically accelerated in our test cases. The behavior of the algorithm is demonstrated through simulations of a linear quadratic control problem and a pole balancing problem.

著者関連情報

お気に入り & アラート

閲覧履歴

発行機関からのお知らせ

PDF閲覧時に認証を求められる記事がございます（発行後2年間）が，人工知能学会の個人会員は無料で閲覧可能です．認証のための購読者番号やパスワードは会員マイページ（ユース会員の場合はジュニア・ユース会員サイト）にログインし「お知らせ」にてご確認下さい（会員情報管理システムとオンラインで連携していないため，パスワードは同システムとは異なります．また，認証情報の更新は偶数月の月末に実施しております．新規入会された方は利用できるまでしばらくお待ちください）．個人会員以外は記事複製申込フォームから購入いただけます．また，アマゾンにて冊子版あるいはKindle版を購入いただくことも可能です．

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）