Joho Chishiki Gakkaishi
Online ISSN : 1881-7661
Print ISSN : 0917-1436
ISSN-L : 0917-1436
An Empirical Comparison and Ensemble Learning Methods of BERT Models on Authorship Attribution
Taisei KANDAMingzhe JIN
Author information
JOURNAL FREE ACCESS

2024 Volume 34 Issue 3 Pages 244-255

Details
Abstract

Bidirectional Encoder Representations from Transformers (BERT) is a general-purpose language model designed to be pre-trained on a large amount of training data, fine-tuned, and then adapted to tasks in individual fields. Japanese BERT models have been released based on training data from Wikipedia, Aozora Bunko, and Japanese business news articles, which are relatively easy to obtain. In this study, we compared the performance of multiple BERT models constructed from different pre-training data on an author attribution task, and analyzed the impact of pre-training data on individual tasks. We also studied methods to improve the accuracy of author attribution models by ensemble learning using multiple BERT models. As a result, we found that a BERT model trained on the Aozora Bunko corpus performed well in estimating authors in Aozora Bunko. This clearly shows that pre training data affected the performance of the model when solving individual tasks. We also found that the performance of an ensemble learning architecture comprising multiple BERT models was better than that of a single model.

Content from these authors
© 2024 Japan Society of Information and Knowledge
Previous article Next article
feedback
Top