Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
37th (2023)
Session ID : 3Xin4-03
Conference information

Subcorpus Extractraction from a Huge Corpus for Task Adaptation of a Language Model
*Shota MOTOURAKosuke AKIMOTOJunta MAKIOKunihiko SADAMASA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Given a downstream task, additional pretraining of a language model with its domain corpus is known to be effective in adaptation to the task. Existing studies assume that a required domain corpus or training data for the downstream task sufficient for additional pretraining is available; however, this is not always the case in practice. This paper proposes a method to extract a subcorpus suitable for additional pretraining from a huge corpus on the basis of available training data for the downstream task. We also show our experiment result that supports that a subcorpus extracted using our method improves the performance in its downstream task.

Content from these authors
© 2023 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top