Host: The Japanese Society for Artificial Intelligence
Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 38
Location : [in Japanese]
Date : May 28, 2024 - May 31, 2024
This study investigated the impact of tokenizer segmentation methods and vocabulary size differences on the language model BERT. There are subword-based tokenizers, such as WordPiece, which do not cross morphological boundaries set by morphological analyzers, and those like SentencePiece, which do not consider semantic boundaries. In domains featuring specialized terminology and compound words, such as medicine, maintaining semantic word boundaries might be advantageous. Therefore, we trained tokenizers that tokenize on a word-by-word basis and those that tokenize based on subwords, with varying vocabulary sizes, and pre-trained BERT models. The models were then fine-tuned and evaluated on three tasks: JGLUE, Wikipedia named entity extraction, and medical entity extraction, to compare their performance. Additionally, we compared models specialized in the medical domain, which frequently involves compound terms and specialized vocabulary, to assess the impact of the tokenizer. The results showed that in the field of medical entity extraction, pre-trained models that increased vocabulary size using a medical domain-specific dictionary outperformed the baseline models that use subwords.