Host: The Japanese Society for Artificial Intelligence
Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 37
Location : [in Japanese]
Date : June 06, 2023 - June 09, 2023
Given a downstream task, additional pretraining of a language model with its domain corpus is known to be effective in adaptation to the task. Existing studies assume that a required domain corpus or training data for the downstream task sufficient for additional pretraining is available; however, this is not always the case in practice. This paper proposes a method to extract a subcorpus suitable for additional pretraining from a huge corpus on the basis of available training data for the downstream task. We also show our experiment result that supports that a subcorpus extracted using our method improves the performance in its downstream task.