Journal of Information Processing
Online ISSN : 1882-6652
ISSN-L : 1882-6652
Effects and Mitigation of Out-of-vocabulary in Universal Language Models
Sangwhan MoonNaoaki Okazaki
著者情報
ジャーナル フリー

2021 年 29 巻 p. 490-503

詳細
抄録

One of the most important recent natural language processing (NLP) trends is transfer learning - using representations from language models implemented through a neural network to perform other tasks. While transfer learning is a promising and robust method, downstream task performance in transfer learning depends on the robustness of the backbone model's vocabulary, which in turn represents both the positive and negative characteristics of the corpus used to train it. With subword tokenization, out-of-vocabulary (OOV) is generally assumed to be a solved problem. Still, in languages with a large alphabet such as Chinese, Japanese, and Korean (CJK), this assumption does not hold. In our work, we demonstrate the adverse effects of OOV in the context of transfer learning in CJK languages, then propose a novel approach to maximize the utility of a pre-trained model suffering from OOV. Additionally, we further investigate the correlation of OOV to task performance and explore if and how mitigation can salvage a model with high OOV.

著者関連情報
© 2021 by the Information Processing Society of Japan
前の記事 次の記事
feedback
Top