Host: Japan Society for Fuzzy Theory and Intelligent Informatics (SOFT)
UCB algorithm was proposed as one of the action choice methods used in a multi-armed bandit problem. In this method, an agent chooses the action by comparing upper bound of confidence intervals of estimated values, thereby it has a better performance than others, like ε-greedy. In this paper, we proposed the method to apply UCB algorithm to Q-learning, and experimentally evaluated its performance by the shortest path problem in the continuous state spaces.