Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
38th (2024)
Session ID : 3Xin2-11
Conference information

Pretaining Language Models and Application to Medical Domain by a Variety of Lexia and Tokenizers
*Ami SAKANEShumpei MURAMATSUHiromasa HORIGUCHIYoshinobu KANO
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

This study investigated the impact of tokenizer segmentation methods and vocabulary size differences on the language model BERT. There are subword-based tokenizers, such as WordPiece, which do not cross morphological boundaries set by morphological analyzers, and those like SentencePiece, which do not consider semantic boundaries. In domains featuring specialized terminology and compound words, such as medicine, maintaining semantic word boundaries might be advantageous. Therefore, we trained tokenizers that tokenize on a word-by-word basis and those that tokenize based on subwords, with varying vocabulary sizes, and pre-trained BERT models. The models were then fine-tuned and evaluated on three tasks: JGLUE, Wikipedia named entity extraction, and medical entity extraction, to compare their performance. Additionally, we compared models specialized in the medical domain, which frequently involves compound terms and specialized vocabulary, to assess the impact of the tokenizer. The results showed that in the field of medical entity extraction, pre-trained models that increased vocabulary size using a medical domain-specific dictionary outperformed the baseline models that use subwords.

Content from these authors
© 2024 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top