Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 4Xin2-86
Conference information

The Effects of Pre-Training LLMs with Domain Corpus Sampling
*Yui OBARANao SOUMATeruno KAJIURAKimio KURAMITSU
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Large language model (LLM) have shown remarkable capabilities in code generation. In order to improve performance on these target tasks, It is essential to train LLM by domain-specific corpus containing specialized terms and domain knowledge.However, there is a significant lack of such corpus, and the effort and time required to build new corpus is considerable. In this study, we introduce domain sampling an efficient approach to build domain-specific corpus by extracting from the large general corpus. We propose to build a vocabulary model enriched with domain-specific terms through SentencePiece and classify texts as related or unrelated to the domain based on their tokenization results. In our experiments, we found that when LLM was pre-trained from scratch on the corpus collected in our proposal, its ability was improved in generating code from Japanese.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top