Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 3Xin2-88
Conference information

Target-oriented Exploration in Neural Contextual Bandits
*Shogo ITOTatsuji TAKAHASHIYu KONO
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Selection algorithms for advertising delivery and recommendation are an indispensable part of Web services. Contextual bandit algorithms are particularly useful to reflect human preferences in existing tasks in recommendation, with advantages such as real-time responsiveness and strength in cold starts. Their combination with reinforcement learning, such as ChatGPT's RLHF tuning, also can allow further adaptation to human preferences. However, in industrial applications, the emphasis is more on quick achievement of specific standards, rather than extensive exploratory environmental adaptation. We therefore focused on target-oriented achievement, which is a human decision-making tendency. A meta-policy that incorporates this tendency is Regional Linear Risk-sensitive Satisficing (RegLinRS). Tsuboya et al. have shown its high performance in environments with linear reward. It can also be expected to achieve high performance in environments with non-linear reward. We developed Neural Regional Risk-sensitive Satisficing (NeuralRegRS), an extension of RegLinRS for complex function approximation, and tested its performance on environments using both artificial and real-world datasets.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top