Journal of Information Processing
Online ISSN : 1882-6652
ISSN-L : 1882-6652
Effects and Mitigation of Out-of-vocabulary in Universal Language Models
Sangwhan MoonNaoaki Okazaki
Author information

2021 Volume 29 Pages 490-503


One of the most important recent natural language processing (NLP) trends is transfer learning - using representations from language models implemented through a neural network to perform other tasks. While transfer learning is a promising and robust method, downstream task performance in transfer learning depends on the robustness of the backbone model's vocabulary, which in turn represents both the positive and negative characteristics of the corpus used to train it. With subword tokenization, out-of-vocabulary (OOV) is generally assumed to be a solved problem. Still, in languages with a large alphabet such as Chinese, Japanese, and Korean (CJK), this assumption does not hold. In our work, we demonstrate the adverse effects of OOV in the context of transfer learning in CJK languages, then propose a novel approach to maximize the utility of a pre-trained model suffering from OOV. Additionally, we further investigate the correlation of OOV to task performance and explore if and how mitigation can salvage a model with high OOV.

Content from these authors
© 2021 by the Information Processing Society of Japan
Previous article Next article