悲観的なRLHF

森村 哲郎; 坂本 充生

doi:10.11517/pjsai.JSAI2024.0_4Xin213

Abstract

Large language models (LLMs), including ChatGPT, are commonly fine-tuned using Reinforcement Learning from Human Feedback (RLHF). However, in RLHF, learning reward models from limited human feedback is challenging, as it is impossible to predict human preference perfectly, leading to an issue of the reward model overoptimization. This poses a significant challenge in applying RLHF. In this study, we propose an approach to address this issue by learning multiple reward models with diversity and conducting a pessimistic evaluation of rewards. Precisely, we assess the confidence of reward calculations based on the variability in outputs from different reward models, and we adopt a pessimistic evaluation of rewards. The effectiveness of this approach is experimentally validated.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!