Host: Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT)
Name : 37th Fuzzy System Symposium
Number : 37
Location : [in Japanese]
Date : September 13, 2021 - September 15, 2021
In action selection policy during deep reinforcement learning, it is possible to balance exploration and exploitation efficiently by considering the selection frequency of state action pairs. However, when the similarity of states is also learned in parallel, it is difficult to accurately count how many times each state has been reached in the past. In this paper, we propose a new method to estimate the value of each state in consideration of the balance between exploration and exploitation by constructing a network which estimates only whether or not the state has been reached in the past but has no reward. The frequency of state reached should simply increase as learning progresses, so we set such a function. The policy takes into account the mean and variance of the beta distribution constructed from reward values and their experience values. The effectiveness of the proposed method is confirmed by numerical experiments.