ニューラルバンディットにおける目的志向探索

伊東 将吾; 高橋 達二; 甲野 佑

doi:10.11517/pjsai.JSAI2024.0_3Xin288

Abstract

Selection algorithms for advertising delivery and recommendation are an indispensable part of Web services. Contextual bandit algorithms are particularly useful to reflect human preferences in existing tasks in recommendation, with advantages such as real-time responsiveness and strength in cold starts. Their combination with reinforcement learning, such as ChatGPT's RLHF tuning, also can allow further adaptation to human preferences. However, in industrial applications, the emphasis is more on quick achievement of specific standards, rather than extensive exploratory environmental adaptation. We therefore focused on target-oriented achievement, which is a human decision-making tendency. A meta-policy that incorporates this tendency is Regional Linear Risk-sensitive Satisficing (RegLinRS). Tsuboya et al. have shown its high performance in environments with linear reward. It can also be expected to achieve high performance in environments with non-linear reward. We developed Neural Regional Risk-sensitive Satisficing (NeuralRegRS), an extension of RegLinRS for complex function approximation, and tested its performance on environments using both artificial and real-world datasets.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!