テキストベクトルの重みづけを用いた タスクに対する単語分割の最適化

平岡 達也; 高瀬 翔; 内海 慶; 欅 惇志; 岡崎 直観

doi:10.5715/jnlp.28.479

Abstract

In traditional NLP, we tokenize a sentence as a preprocessing, and thus the tokenization is unrelated to a downstream task. To address this issue, we propose a novel method to explore an appropriate tokenization for the downstream task. Our proposed method, Optimizing Tokenization (OpTok), is trained to assign a high probability to such appropriate tokenization based on the downstream task loss. OpTok can be used for any downstream task which uses a sentence vector representation such as text classification. Experimental results demonstrate that OpTok improves the performance of sentiment analysis, genre prediction, rating prediction, and textual entailment. The results also show that the proposed method is applicable to Chinese, Japanese, and English. In addition, we introduce OpTok into BERT, the state-of-the-art contextualized embeddings, and report a positive effect on the performance.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!