Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
Large language model (LLM) have shown remarkable capabilities in code generation. In order to improve performance on these target tasks, It is essential to train LLM by domain-specific corpus containing specialized terms and domain knowledge.However, there is a significant lack of such corpus, and the effort and time required to build new corpus is considerable. In this study, we introduce domain sampling an efficient approach to build domain-specific corpus by extracting from the large general corpus. We propose to build a vocabulary model enriched with domain-specific terms through SentencePiece and classify texts as related or unrelated to the domain based on their tokenization results. In our experiments, we found that when LLM was pre-trained from scratch on the corpus collected in our proposal, its ability was improved in generating code from Japanese.