Abstract
In this paper, we integrated Discounted UCB1-tuned, which uses weighted value and weighted variance, into Q-learning agents and experimentally evaluated its performance. Discounted UCB1-tuned is an optimized selection method that balances exploration and exploitation and outperforms other methods, including ε-greedy. We conducted experiments on the effect of default values and learning rate in a multi-armed bandit problem. Our algorithm selects actions its value is not updated or with the highest UCB value in updatable state-actions. We show the results of the continuous state spaces shortest path problem followed by a discussion.