抄録
This paper discusses the learning behaviours of variable-structure Stochastic Automata under stationary random environment.
A new reinforcement scheme (TNP) of the reward-penalty type which can ensure ε-optimality under all stationary random environments is proposed. It is proved, using Semi-Martingale Inequality and complex manipulations, that the TNP scheme can ensure ε-optimality. Moreover two reinforcement schemes (Lr-1 and T1) which have been contrived are discussed from that point of view.
The TNP scheme is superior to Lr-1 and T1 in the following respects ((1), (2)):
(1) The Lr-1 scheme is the reward-inaction type.
(2) The T1 scheme can ensure ε-optimality only under certain conditions.
Computer simulation results also indicate that the TNP scheme accomplishes the most effective learning behaviour.