Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
Large language models (LLMs), including ChatGPT, are commonly fine-tuned using Reinforcement Learning from Human Feedback (RLHF). However, in RLHF, learning reward models from limited human feedback is challenging, as it is impossible to predict human preference perfectly, leading to an issue of the reward model overoptimization. This poses a significant challenge in applying RLHF. In this study, we propose an approach to address this issue by learning multiple reward models with diversity and conducting a pessimistic evaluation of rewards. Precisely, we assess the confidence of reward calculations based on the variability in outputs from different reward models, and we adopt a pessimistic evaluation of rewards. The effectiveness of this approach is experimentally validated.