Host: The Japanese Society for Artificial Intelligence
Name : The 36th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 36
Location : [in Japanese]
Date : June 14, 2022 - June 17, 2022
The development of deep neural networks has made it possible to achieve performance that exceeds human performance in simulation reinforcement learning problems. However, for real-world problems, issues such as explainability and online learning remain. Because real-world environments include reward-independent observables, the apparent pattern of observables becomes so large that it is difficult to explain AI's operating principles. In addition, achieving high performance requires a large amount of training data, making online learning difficult. Therefore, in this study, we attempt online policy learning in an environment that generates a huge number of patterns of observables by combining reward-dependent and reward-independent environments. The proposed learning method consists of action decisions that control exploration and exploitation by sampling and reward-oriented environment inference that reduces the number of observable patterns to a concise state. As a result, the reward-oriented environment inference model recovers the reward-dependent environment from a large number of observable patterns. Furthermore, the combination of the proposed model and the action decision improved the learning speed of the optimal policy.