経験に固執しないProfit Sharing法

植村 渉; 上野 敦志; 辰巳 昭治

doi:10.1527/tjsai.21.81

抄録

Profit Sharing is one of the reinforcement learning methods. An agent, as a learner, selects an action with a state-action value and receives rewards when it reaches a goal state. Then it distributes receiving rewards to state-action values. This paper discusses how to set the initial value of a state-action value.
A distribution function ƒ(x) is called as the reinforcement function. On Profit Sharing, an agent learns a policy by distributing rewards with the reinforcement function. On Markov Decision Processes (MDPs), the reinforcement function ƒ(x) = 1/L^x is useful, and on Partially Observable Markov Decision Processes (POMDPs), ƒ(x) = 1/L^w is useful, where L is the sufficient number of rules at each state, and W is the length of an episode.
If episodes are always long, the value of the reinforcement function is little. So the differences of rule values become little, and the agent learns little by using the roulette selection as an action selection. This problem is called as Learning Speed Problem.
If the value of the reinforcement function for an action is very higher than its state-action value, an agent will not select other action. There is a problem when its action is not a optimal action. This problem is called as Past Experiences Problem.
This paper shows that both Learning Speed Problem and Past Experiences Problem are caused by the bad setting between the initial values of a state-action values and the function values of a reinforcement function. We propose how to set the initial values of a state-action values at each state. The experiment shows that an agent can learn correctly even if the length of episode is large. And shows the effectiveness on both MDPs and POMDPs. Our proposed method focuses on the initialization of state-action values and does not limit reinforcement functions. So it can apply to any reinforcement function.

著者関連情報

お気に入り & アラート

閲覧履歴

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）