2024 Volume 34 Issue 3 Pages 244-255
Bidirectional Encoder Representations from Transformers (BERT) is a general-purpose language model designed to be pre-trained on a large amount of training data, fine-tuned, and then adapted to tasks in individual fields. Japanese BERT models have been released based on training data from Wikipedia, Aozora Bunko, and Japanese business news articles, which are relatively easy to obtain. In this study, we compared the performance of multiple BERT models constructed from different pre-training data on an author attribution task, and analyzed the impact of pre-training data on individual tasks. We also studied methods to improve the accuracy of author attribution models by ensemble learning using multiple BERT models. As a result, we found that a BERT model trained on the Aozora Bunko corpus performed well in estimating authors in Aozora Bunko. This clearly shows that pre training data affected the performance of the model when solving individual tasks. We also found that the performance of an ensemble learning architecture comprising multiple BERT models was better than that of a single model.