Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 1B3-GS-2-02
Conference information

Assessing Distribution Shift in Reinforcement Learning from Human Feedback
*Mitsuki SAKAMOTOTetsuro MORIMURAYuu JINNAIKenshi ABEKaito ARIU
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Reinforcement learning from human feedback (RLHF) is often used for fine-tuning large-language models (LLMs). The RLHF pipeline consists of four processes: (1) Supervised Fine-Tuning (SFT) of LLMs, (2) ranking, based on human preference, the generated texts by the SFT model, (3) training of the reward model from preference data, and (4) reinforcement learning of the SFT model using the reward model. Due to the cost of gathering human preference data, public datasets are often used to train the reward model. Because the data generation model and the SFT model are different, there is a distribution shift between the data used for learning and the data used for evaluation in the reward model. In this study, to analyze the effect of distribution shift, we create external datasets of preferences, generated using LLMs different from the SFT model. We perform RLHF using these datasets to artificially introduce distribution shift into the RLHF process so that we can elucidate situations where distribution shift poses a problem. Our experimental results show a decrease in the quality of the RLHF model when using external preference datasets, suggesting the impact of a distribution shift.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top