Abstract
We regard the sequence of returns as outputs from a parametric compound source. Utilizing the fact that the coding rate of the source shows the amount of information about the return, we describe l-learning algorithms based on the predictive coding idea for estimating an expected information gain concerning future information. Using the information gain, we propose the ratio w of return loss to information gain as a new criterion to be used in probabilistic action selection strategies. In experimental results, we found our w-based strategy performs well compared with the conventional Q-based strategy.