決定木と信頼上界を用いた文脈付きバンディットアルゴリズム手法の提案

大岩 将; 阿部 太一; 木村 恵悟; 鈴木 佐俊; 後藤 正幸

doi:10.11517/pjsai.JSAI2024.0_4D3GS204

Abstract

Contextual Bandit Algorithm, an online recommendation system, makes sequential recommendations based on contextual information such as user attributes and past purchase history, and estimates rewards that indicate whether or not a product will be purchased. In doing so, the method balances exploration, which recommends various products, and utilization, which recommends products with high expected rewards, to maximize the accumulated reward. To accurately estimate rewards from context, it is essential to assume a model suitable for the situation, but often a nonlinear relationship between context and reward is observed. TreeBootstrap, which uses decision trees to estimate rewards, has been proposed as a method suitable for such situations. However, since TreeBootstrap balances search and utilization by bootstrapping the training data, it may not be able to use the information obtained from search sufficiently, especially in utilization, and the cumulative reward may not be sufficient. In this study, we propose TreeUCB, which balances exploration and exploitation by using a confidence upper bound of expected reward, instead of bootstrap sampling of training data. We demonstrate the effectiveness of the proposed method through experiments using artificial and real data.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!