2009 年 129 巻 7 号 p. 1339-1347
Profit Sharing is one of exploitation oriented reinforcement learning methods and aims to adapt a system to a given environment. In Profit Sharing, an agent learns a policy based on the reward that is received from the environment when it reaches a goal state. It is important to design a reinforcement function that distributes the received reward to each action rule in the policy. If the reinforcement function satisfies the ineffective rule suppression theorem, the reinforcement function is able to distribute more reward to effective rules than ineffective ones, even in the worst case where an ineffective rule is infinitely selected. The value of the reinforcement function, however, decreases exponentially with distance from the goal state. As a result, the agent fails to learn an appropriate policy when the episode length from an initial state to the goal state is relatively long. In this paper, we report a new dynamic reinforcement function considering the expected value of reward which is distributed to each rule. Using our reinforcement function, the expected value of reward distributed to the effective rules becomes larger than that to the ineffective ones. Even when the episode length becomes long, a decrease in the value of the reinforcement function is able to be suppressed, and thus the agent is able to learn an appropriate policy. We apply our reinforcement function to Sutton's maze problem, and show its effectiveness.
J-STAGEがリニューアルされました! https://www.jstage.jst.go.jp/browse/-char/ja/