Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4Xin2-13
Conference information

Pessimistic RLHF
*Tetsuro MORIMURAMitsuki SAKAMOTO
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Large language models (LLMs), including ChatGPT, are commonly fine-tuned using Reinforcement Learning from Human Feedback (RLHF). However, in RLHF, learning reward models from limited human feedback is challenging, as it is impossible to predict human preference perfectly, leading to an issue of the reward model overoptimization. This poses a significant challenge in applying RLHF. In this study, we propose an approach to address this issue by learning multiple reward models with diversity and conducting a pessimistic evaluation of rewards. Precisely, we assess the confidence of reward calculations based on the variability in outputs from different reward models, and we adopt a pessimistic evaluation of rewards. The effectiveness of this approach is experimentally validated.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top