コーパスのドメインサンプリングによるLLM事前学習の効果について

小原 有以; 相馬 菜生; 梶浦 照乃; 倉光 君郎

doi:10.11517/pjsai.JSAI2024.0_4Xin286

Abstract

Large language model (LLM) have shown remarkable capabilities in code generation. In order to improve performance on these target tasks, It is essential to train LLM by domain-specific corpus containing specialized terms and domain knowledge.However, there is a significant lack of such corpus, and the effort and time required to build new corpus is considerable. In this study, we introduce domain sampling an efficient approach to build domain-specific corpus by extracting from the large general corpus. We propose to build a vocabulary model enriched with domain-specific terms through SentencePiece and classify texts as related or unrelated to the domain based on their tokenization results. In our experiments, we found that when LLM was pre-trained from scratch on the corpus collected in our proposal, its ability was improved in generating code from Japanese.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!