多目的閾値バンディット問題のPosterior Trackingアルゴリズム

鈴木 理矩; 中村 篤祥

doi:10.11517/pjsai.JSAI2024.0_4Xin268

Abstract

Multi-objective thresholding bandits aim to identify all the good arms by repeatedly selecting one arm from given set of K arms at each time to observe multi-dimensional rewards. Here, an arm is said to be good if its expected reward of each dimension is no less than its specified threshold of the dimension. In fixed confidence setting, we show the optimal allocation of each arm drawn which achieves asymptotic lower bound in this problem, and present the expression of generalized likelihood ratio statistics used for the stopping condition. We apply them and the algorithm, named P-Tracking, based on posterior sampling to this problem. We verify the effectiveness of P-Tracking by using artificial data. Through experimental comparison against C-Tracking and D-Tracking, which conduct fixing the expected reward estimation by forced exploration in stead of posterior sampling to explore for correct answer search, and naive multi-dimensional extension of HDoC, which is effective in one-dimensional reward thresholding bandits, we show that P-Tracking identifies all good arms from averagely fewer samples.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!