RLHFにおける分布シフトの評価

坂本 充生; 森村 哲郎; 陣内 佑; 阿部 拳之; 蟻生 開人

doi:10.11517/pjsai.JSAI2024.0_1B3GS202

Abstract

Reinforcement learning from human feedback (RLHF) is often used for fine-tuning large-language models (LLMs). The RLHF pipeline consists of four processes: (1) Supervised Fine-Tuning (SFT) of LLMs, (2) ranking, based on human preference, the generated texts by the SFT model, (3) training of the reward model from preference data, and (4) reinforcement learning of the SFT model using the reward model. Due to the cost of gathering human preference data, public datasets are often used to train the reward model. Because the data generation model and the SFT model are different, there is a distribution shift between the data used for learning and the data used for evaluation in the reward model. In this study, to analyze the effect of distribution shift, we create external datasets of preferences, generated using LLMs different from the SFT model. We perform RLHF using these datasets to artificially introduce distribution shift into the RLHF process so that we can elucidate situations where distribution shift poses a problem. Our experimental results show a decrease in the quality of the RLHF model when using external preference datasets, suggesting the impact of a distribution shift.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!