Q-learning methods evaluate and update action values using information on rewards obtained. Since the Q value can not be updated until the learning succeeds and the reward is obtained, there is no index for learning, which causes a problem of requiring much time for learning. In cases, the route with no spread in the maze where the probability that learning fails is high is the semi shortest route from the start to the goal, the semi shortest route can not be learned.
To learn the optimal actions and discover the semi shortest path, it is essential to experience a large number of unknown states at early stages of the learning process. To this end, in this work we propose unknown-adventure Q-learning, in which agents maintain an action history and adventurously seek out unknown states that have not yet been recorded in this history. When unknown states are present, the agent proceeds boldly and adventurously to search these states without fear of failure. Our unknown-adventure Q-learning experiences large numbers of states at early stages of the learning process, ensuring that actions may be selected in a way that avoids previous failures.
This enables a massive acceleration of the learning process in which the number of episodes required to learn a path from start to goal is reduced 100-fold compared to the original Q-learning method. Moreover, our method is capable of discovering the semi shortest-length path through a maze even in cases where that path does not expand through the maze, a case in which learning failures are common and in which the semi shortest path cannot be discovered by methods that use V-filters or action-region valuations to accelerate learning by emphasizing prior knowledge.