Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
Building a Large Corpus and Pre-trained Language Models from National and Local Assembly Minutes
Keiyu NagafuchiYasutomo KimuraKazuma KadowakiKenji Araki
Author information
JOURNAL FREE ACCESS

2024 Volume 31 Issue 2 Pages 707-732

Details
Abstract

In this study, we collected minutes from national and local assemblies published on the web, constructing a large corpus. In addition, we developed pre-trained language models adapted to the Japanese political domain using the constructed corpus of meeting records, incorporating several derivatives. Our models demonstrated superior and comparable performances to conventional models for tasks within the political and nonpolitical domains, respectively. In addition, we showed that increasing the number of training steps during domain adaptation with additional pre-training improves performance significantly. Furthermore, leveraging the corpus from the initial pre-training enhances performance in the adapted domain while maintaining performance in the non-adapted domains.

Content from these authors
© 2024 The Association for Natural Language Processing
Previous article Next article
feedback
Top