超大規模コーパスからの抽出コーパスによる言語モデルのタスク適応

本浦 庄太; 秋元 康佑; 槇尾 純太; 定政 邦彦

doi:10.11517/pjsai.JSAI2023.0_3Xin403

37th (2023)

Session ID : 3Xin4-03

DOI https://doi.org/10.11517/pjsai.JSAI2023.0_3Xin403

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 37

Location : [in Japanese]

Date : June 06, 2023 - June 09, 2023

Subcorpus Extractraction from a Huge Corpus for Task Adaptation of a Language Model

*Shota MOTOURA, Kosuke AKIMOTO, Junta MAKIO, Kunihiko SADAMASA

Author information

Keywords: language model, additional pretraining, task adaptation, document search

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

Given a downstream task, additional pretraining of a language model with its domain corpus is known to be effective in adaptation to the task. Existing studies assume that a required domain corpus or training data for the downstream task sufficient for additional pretraining is available; however, this is not always the case in practice. This paper proposes a method to extract a subcorpus suitable for additional pretraining from a huge corpus on the basis of available training data for the downstream task. We also show our experiment result that supports that a subcorpus extracted using our method improves the performance in its downstream task.

Corresponding author

Conference information

Register with J-STAGE for free!