Gaussian 分布の報酬への自然強化学習の拡張

小河 将真; 有村 柊一; 高橋 達二; 甲野 佑

doi:10.11517/pjsai.JSAI2023.0_2Q4OS27b04

Abstract

Reinforcement learning, a machine learning approach in which the agent leans behavior through interaction with the environment to maximize reward, has recently been actively studied and made great progress. In particular, bandit algorithms are widely used, for example, in the field of recommender systems including ad serving. But reward maximization in such fields can be difficult due to the complexity and non-stationarity of humans. In such cases, securing a certain level of reward, rather than simply keep aiming at maximization, can be more important. Algorithms in this approach concur with the property of human preferences too, and show excellent performance when the said level is chosen properly. Risk-sensitive Satisficing (RS) incorporates such cognitive tendencies into the search, and RS is a natural reinforcement learning algorithm that aims to achieve a desired level of performance according to a set objective. Although it shows excellent performance in natural reinforcement learning, such as the Bernoulli distribution reward used to determine whether a user clicked on an advertisement or a product, in practical applications, the Bandit problem often deals with continuous-valued rewards such as viewing time. In this study, we examine the performance of RS when applied to the bandit problem with real-valued rewards from a normal distribution, providing some considerations.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!