後段タスクの精度向上のためのマルチレベルな分かち書きの最適化

小田倉 史麿; 若林 啓

doi:10.11517/pjsai.JSAI2022.0_3C4GS603

Abstract

Tokenization is known to affect the accuracy of downstream tasks. Hiraoka et al. proposed optok4at, an optimization method of tokenization for improving the accuracy of downstream tasks. However, since only one type of tokenizer is used in optok4at, and the vocabulary is formed by unsupervised learning, there is a risk that the tokenizer will miss infrequent but important phrases, resulting in a loss of accuracy. In this paper, we propose an optimization method using multiple tokenizers for improving the accuracy of downstream tasks. The proposed method concatenates the outputs of two tokenizers with different vocabularies and inputs them to the downstream model. By using not only an unsupervised tokenizer but also a dictionary-based tokenizer containing vocabularies of frequent phrases, we attempt to improve the accuracy of downstream tasks. In several text classification tasks, we confirmed that the proposed method does not contribute to improving the accuracy, despite it tokenizing phrases.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!