強化学習における報酬志向な環境推定

高橋 春輝; 深井 朋樹; 酒井 裕; 竹川 高志

doi:10.11517/pjsai.JSAI2022.0_4E1GS203

Abstract

The development of deep neural networks has made it possible to achieve performance that exceeds human performance in simulation reinforcement learning problems. However, for real-world problems, issues such as explainability and online learning remain. Because real-world environments include reward-independent observables, the apparent pattern of observables becomes so large that it is difficult to explain AI's operating principles. In addition, achieving high performance requires a large amount of training data, making online learning difficult. Therefore, in this study, we attempt online policy learning in an environment that generates a huge number of patterns of observables by combining reward-dependent and reward-independent environments. The proposed learning method consists of action decisions that control exploration and exploitation by sampling and reward-oriented environment inference that reduces the number of observable patterns to a concise state. As a result, the reward-oriented environment inference model recovers the reward-dependent environment from a large number of observable patterns. Furthermore, the combination of the proposed model and the action decision improved the learning speed of the optimal policy.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!