国会および地方議会の会議録に基づく大規模なコーパスと事前学習済み言語モデルの構築

永渕 景祐; 木村 泰知; 門脇 一真; 荒木 健治

doi:10.5715/jnlp.31.707

Abstract

In this study, we collected minutes from national and local assemblies published on the web, constructing a large corpus. In addition, we developed pre-trained language models adapted to the Japanese political domain using the constructed corpus of meeting records, incorporating several derivatives. Our models demonstrated superior and comparable performances to conventional models for tasks within the political and nonpolitical domains, respectively. In addition, we showed that increasing the number of training steps during domain adaptation with additional pre-training improves performance significantly. Furthermore, leveraging the corpus from the initial pre-training enhances performance in the adapted domain while maintaining performance in the non-adapted domains.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Add to favorites
Additional info alert
Citation alert
Authentication alert

Corresponding author

Register with J-STAGE for free!