Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
36th (2022)
Session ID : 3C4-GS-6-03
Conference information

Optimization of Multi-level Tokenization for Improving Accuracy of Downstream Tasks
*Fumimaro ODAKURAKei WAKABAYASHI
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

Tokenization is known to affect the accuracy of downstream tasks. Hiraoka et al. proposed optok4at, an optimization method of tokenization for improving the accuracy of downstream tasks. However, since only one type of tokenizer is used in optok4at, and the vocabulary is formed by unsupervised learning, there is a risk that the tokenizer will miss infrequent but important phrases, resulting in a loss of accuracy. In this paper, we propose an optimization method using multiple tokenizers for improving the accuracy of downstream tasks. The proposed method concatenates the outputs of two tokenizers with different vocabularies and inputs them to the downstream model. By using not only an unsupervised tokenizer but also a dictionary-based tokenizer containing vocabularies of frequent phrases, we attempt to improve the accuracy of downstream tasks. In several text classification tasks, we confirmed that the proposed method does not contribute to improving the accuracy, despite it tokenizing phrases.

Content from these authors
© 2022 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top