異なる語彙とトークナイザを用いた言語モデルの事前学習と医療ドメインへの適用

坂根 亜美; 村松 俊平; 堀口 裕正; 狩野 芳伸

doi:10.11517/pjsai.JSAI2024.0_3Xin211

Abstract

This study investigated the impact of tokenizer segmentation methods and vocabulary size differences on the language model BERT. There are subword-based tokenizers, such as WordPiece, which do not cross morphological boundaries set by morphological analyzers, and those like SentencePiece, which do not consider semantic boundaries. In domains featuring specialized terminology and compound words, such as medicine, maintaining semantic word boundaries might be advantageous. Therefore, we trained tokenizers that tokenize on a word-by-word basis and those that tokenize based on subwords, with varying vocabulary sizes, and pre-trained BERT models. The models were then fine-tuned and evaluated on three tasks: JGLUE, Wikipedia named entity extraction, and medical entity extraction, to compare their performance. Additionally, we compared models specialized in the medical domain, which frequently involves compound terms and specialized vocabulary, to assess the impact of the tokenizer. The results showed that in the field of medical entity extraction, pre-trained models that increased vocabulary size using a medical domain-specific dictionary outperformed the baseline models that use subwords.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!