k-確実探査法と動的計画法を用いたMDPs環境の効率的探索法

舘山 武史; 川田 誠一

doi:10.1527/tjsai.16.11

抄録

One most common problem in reinforcement learning systems (e.g. Q-learning) is to reduce the number of trials to converge to an optimal policy. As one of the solution to the problem, k-certainty exploration method was proposed. Miyazaki reported that this method could determine an optimal policy faster than Q-learning in Markov decision processes (MDPs). This method is very efficient learning method. But, we propose an improvement plan that makes this method more efficient. In k-certainty exploration method, in case there is no k-uncertainty rule (rules which aren’t selected more than k times) in current state, an agent sometimes walks randomly until it finds a state where it can select a k-uncertainty rule. We think that this behavior is not efficient. To reduce this useless behavior, we propose combining k-certainty exploration method with Dynamic programming (DP). Miyazaki’s system uses DP after all rules are executed at least k times. But, our method uses k-certainty exploration method along with DP during learning. Our method, takes two pattern actions. In case an agent can select k-uncertainty rules, one of these rules is selected at random as k-certainty exploration method. In another case there is no k-uncertainty rule, behavior of agent is different from the behavior of k-certainty exploration method. In that case, our method uses DP to compute an optimal policy for moving from a current state to a state in which some k-uncertainty rules remain. The model for DP is constructed by using only known states. The outline will be described below. First, an agent makes a map constructed by using only known states. In the map, goals are states in which there are k-uncertainty rules and arbitrary state values are set in these states. Point to which attention should be paid is that the map is not given from outside. It is made from only experience. Next, each state values of states in which there are only k-certainty rules in the map are computed by DP (we used Policy Iteration). Finally, an agent continues to select greedy action until it arrives at a state in which it can select a k-uncertainty rule. By this improvement, we expect it can determine an optimal policy faster than k-certainty exploration method. And we have verified that our exploration method can determine an optimal policy faster than k-certainty exploration method by computer simulation.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）