Journal of Japan Society for Fuzzy Theory and Intelligent Informatics
Online ISSN : 1881-7203
Print ISSN : 1347-7986
ISSN-L : 1347-7986
Original Papers
Multi-Agent Reinforcement Learning by a Policy Gradient Method with Energy-Based Policies of a Boltzmann Machine
Seiji ISHIHARAHarukazu IGARASHI
Author information
JOURNAL FREE ACCESS

2022 Volume 34 Issue 3 Pages 624-634

Details
Abstract

Policy gradient methods such as REINFORCE algorithm, which express the gradient function of expected rewards without using value functions, need not assume that policies of agents and environmental models of rewards and state transition probabilities have Markov property when applied to multi-agent systems. One of the policy gradient methods of this kind uses an objective function, which aims to be minimized in order to determine an action, as an energy function of Boltzmann selection corresponding to the policy. It has been shown that the objective function can be flexibly constructed by weights representing such as the value of state-action pairs and heuristics. On the other hand, reinforcement learning in a multi-agent system has the state-explosion problem in which the number of states increases significantly due to the increase in the environmental complexity and the number of agents and actions. As one of the effective measures to avoid the problem, a method of approximating the value function with a Boltzmann machine has been proposed. It is also useful in the policy gradient method with the objective function, if such an approximation can be applied. In this paper, we first propose a policy gradient method that approximates the objective function in the policy expressed by Boltzmann selection with the energy of the Boltzmann machine. Second, we propose a more efficient method that approximates the objective function with the energy of a modular structured restrictive Boltzmann machine, and show the learning rule of the corresponding policy gradient method. As a result of the experiment to a pursuit problem that is an example of multi-agent systems, it was possible to learn appropriate policies with a small number of parameters by both proposed methods. Furthermore, it was shown that the second proposed method managed to significantly reduce the computational cost required for the learning compared to the first proposed method.

Content from these authors
© 2022 Japan Society for Fuzzy Theory and Intelligent Informatics
Previous article Next article
feedback
Top