Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4N3-GS-6-04
Conference information

Robust Pre-Training on Low-Quality Texts via Bregman Divergence
*Takumi NAITOYoichi ISHIBASHIHidetoshi SHIMODAIRA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

In the midst of rapid development of Large Language Models (LLMs), there is an ongoing trend towards enlarging training corpora to train high-performance models. However, not all texts included in such large-scale training corpora are of high quality, and the presence of low-quality texts in these extensively collected corpora could potentially hinder the improvement of model performance. This study proposes a robust learning method, with the objective of mitigating the impact of noise in pre-training of language models with corpora containing low-quality texts found within real-world data sources. Specifically, we focus on a broad class known as Bregman-Divergence, employing β-Divergence and γ-Divergence, which are included in this class and effective in robust statistics. In our experiments, we conducted fine-tuning and additional pre-training of BERT, demonstrating that our proposed method functions robustly in training with noisy training texts and labels, in comparison to the conventional training approach using KL-Divergence.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top