Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
37th (2023)
Session ID : 2D6-GS-3-04
Conference information

Verifying the Influence of Different Tokenizers in Japanese BERT
*Shuntaro ITODaisuke KAWAHARA
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

High accuracy has been achieved in various Japanese language processing tasks by fine-tuning pre-trained Japanese BERT. Input text for Japanese BERT needs to be tokenized into words and subwords, but there are various word dictionaries and subwordization methods. In this study, we create Japanese BERT models with different tokenizers and examine their effects on the masked language model, a pre-training task, and on downstream tasks. It is found that differences in tokenizers cause accuracy differences in masked language models and downstream tasks, and that the performance of masked language models and downstream tasks are not necessarily dependent on each other.

Content from these authors
© 2023 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top