信頼度を近似推定可能にした認知的満足化方策

南 朱音; 吉井 佑輝; 甲野 佑; 高橋 達二

doi:10.11517/pjsai.JSAI2021.0_1G2GS2a03

Abstract

The development of deep reinforcement learning has enabled learning of continuous state-action space, and the results have been remarkable in such a way enabling computers to surpass humans in playing digital and analog games. However, the problem that it requires a huge number of trials and errors has not been solved. In order to reduce the number of explorative action selections, we focus on an adaptive method called satisficing, which is in stark contrast with optimization. Satisficing leads to quick search for an action that satisfies a certain target level. Risk-sensitive Satisficing (RS) model that was defined based on satisficing in addition to “risk attitudes” based on the selection ratio of actions (representing the uncertainty of the value of actions). RS has been shown to be able to learn the optimal action sequence with a small number of exploration and finitely bound regret in the multi-armed bandit problems with when given some optimal target level. The linear RS (LinRS) is a linear approximation method for the RS, but the approximation for selection ratio of each action has not been sufficiently discussed. In this study, we propose StableLinRS, that is a new way to approximate the selection rate in LinRS. We also show the usefulness of StableLinRS in the contextual bandit problems in comparison with existing methods.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!