Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper
Optimizing Word Segmentation for Downstream Tasks by Weighting Text Vector
Tatsuya HiraokaSho TakaseKei UchiumiAtsushi KeyakiNaoaki Okazaki
Author information
JOURNAL FREE ACCESS

2021 Volume 28 Issue 2 Pages 479-507

Details
Abstract

In traditional NLP, we tokenize a sentence as a preprocessing, and thus the tokenization is unrelated to a downstream task. To address this issue, we propose a novel method to explore an appropriate tokenization for the downstream task. Our proposed method, Optimizing Tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a sentence vector representation such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis, genre prediction, rating prediction, and textual entailment. The results also show that the proposed method is applicable to Chinese, Japanese, and English. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings, and report a positive effect on the performance.

Content from these authors
© 2021 The Association for Natural Language Processing
Previous article Next article
feedback
Top