Host: Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT)
Name : 37th Fuzzy System Symposium
Number : 37
Location : [in Japanese]
Date : September 13, 2021 - September 15, 2021
Policy gradient methods such as REINFORCE algorithm, which express the gradient function of expected rewards without using value functions, need not assume that policies of agents and environmental models of rewards and state transition probabilities have Markov property when applied to multi-agent systems. One of the policy gradient methods of this kind uses an objective function, which aims to be minimized in order to determine an action, as an energy function of Boltzmann selection corresponding to the policy. It has been shown that the objective function can be flexibly constructed. On the other hand, reinforcement learning in a multi-agent system has the state-explosion problem. As one of the effective measures to avoid the problem, a method of approximating the value function with a Boltzmann machine has been proposed. In this paper, we first propose a policy gradient method that approximates the objective function in the policy expressed by Boltzmann selection with the energy of the Boltzmann machine. Second, we propose a more efficient method that approximates the objective function with the energy of a modular structured restrictive Boltzmann machine. As a result of the experiment to a pursuit problem, it was possible to learn appropriate policies with a small number of parameters by both proposed methods. Furthermore, the second proposed method managed to significantly reduce the computational cost required for the learning compared to the first one.