We propose a learning rule for a layered neural network, which is based on backpropagation though unsupervised. Instead of a supervisor, the network gets a reward or a penalty from the environment. One characteristic of the network is a stochastic nature of the output units, which enables the learning through a tentative target signal set from a given reward. Namely, the network tentatively assumes the present output correct if a positive reward is given, and incorrect if given a negative, and backpropagates the corresponding error correction as an ordinary backpropagation. The algorithm is especially effective for learning the series of actions, under the additional framework that the reward is not only used to evaluate the present output but also assigned to the temporal series of outputs which result in the present reward. Thus the basic algorithm is extended in two directions to be applied to the time series learning. Type 1: Temporally accumulated values of errors are saved for a future reward, and once a reward is given a chain of outputs are evaluated. Type 2: Neurons with temporal integration of the inputs. Both types 1 and 2 of the extension are applied to time series calculations and several complex motion acquisition tasks of an autonomous agent, which is a simulation model of Khepera. It successfully learns the obstacle avoidance and capturing foods in several different environments using the proposed methods. The entropy of the input patterns during the learning is proposed as an index of the complexity of the learning environment, aiming to characterize the generalization ability to an unlearned environment.
抄録全体を表示