Host: The Japanese Society for Artificial Intelligence
Name : The 37th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 37
Location : [in Japanese]
Date : June 06, 2023 - June 09, 2023
High accuracy has been achieved in various Japanese language processing tasks by fine-tuning pre-trained Japanese BERT. Input text for Japanese BERT needs to be tokenized into words and subwords, but there are various word dictionaries and subwordization methods. In this study, we create Japanese BERT models with different tokenizers and examine their effects on the masked language model, a pre-training task, and on downstream tasks. It is found that differences in tokenizers cause accuracy differences in masked language models and downstream tasks, and that the performance of masked language models and downstream tasks are not necessarily dependent on each other.