日本語BERTにおけるトークナイザの違いによる影響の検証

伊藤 俊太朗; 河原 大輔

doi:10.11517/pjsai.JSAI2023.0_2D6GS304

Abstract

High accuracy has been achieved in various Japanese language processing tasks by fine-tuning pre-trained Japanese BERT. Input text for Japanese BERT needs to be tokenized into words and subwords, but there are various word dictionaries and subwordization methods. In this study, we create Japanese BERT models with different tokenizers and examine their effects on the masked language model, a pre-training task, and on downstream tasks. It is found that differences in tokenizers cause accuracy differences in masked language models and downstream tasks, and that the performance of masked language models and downstream tasks are not necessarily dependent on each other.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!