Bregman Divergenceによる低品質なテキストにおけるロバストな事前学習

内藤 匠海; 石橋 陽一; 下平 英寿

doi:10.11517/pjsai.JSAI2024.0_4N3GS604

Abstract

In the midst of rapid development of Large Language Models (LLMs), there is an ongoing trend towards enlarging training corpora to train high-performance models. However, not all texts included in such large-scale training corpora are of high quality, and the presence of low-quality texts in these extensively collected corpora could potentially hinder the improvement of model performance. This study proposes a robust learning method, with the objective of mitigating the impact of noise in pre-training of language models with corpora containing low-quality texts found within real-world data sources. Specifically, we focus on a broad class known as Bregman-Divergence, employing β-Divergence and γ-Divergence, which are included in this class and effective in robust statistics. In our experiments, we conducted fine-tuning and additional pre-training of BERT, demonstrating that our proposed method functions robustly in training with noisy training texts and labels, in comparison to the conventional training approach using KL-Divergence.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!