Abstract
This paper is concerned with a learning control problem of finite Markov chains with unknown transition probabilities. The learning control consists of two different tasks of estimating the unknown probabilities and controlling the system. The control objective is here to maximize the total expected reward discounted in time, whichis, as well known, accomplished by consecutive application of an optimal policy. But, of course, the optimal policy cannot be found unless the transition probabilities are previously known. Accordingly, the ultimate goal of the learning is to ensure the convergence to the optimal policy, in other words; asymptotic optimality.
Generally speaking, however, it is very difficult to achieve the asymptotic optimality, because there is an apparent conflict between estimation and control. It is pointed out that some of the learning control schemes which have been presented for this problem do not ensure the convergence to the optimal policy. The others dissolve that difficulty in their own ways and achieve the asymtotic optimality. But these are not so practical by reason that in their implementation computational procedures for determining the control policy are very complicated.
This paper presents a learning control scheme such that, in addition to its being asymptotically optimal, practical procedures involved by its policy determination, which is reduced to maximization of a performance criterion, are much simpler. This criterion is designed with consideration of the conflict between estimation and control. For this scheme it is shown that the control policy converges to the optimal in the sense of a frequency of application.