Semantic Shift Stability: 学習コーパス内の単語の意味変化を用いた事前学習済みモデルの時系列性能劣化の監査

石原 祥太郎; 高橋 寛武; 白井 穂乃

doi:10.5715/jnlp.31.1563

Abstract

Auditing time-series performance degradation has become a challenge as researchers and practitioners commonly use pre-trained models. Pre-trained language models typically incur huge costs in training and inference; therefore, considering efficient auditing and retraining schemes becomes important. This study proposes a framework for auditing the time-series performance degradation of pre-trained language models and word embeddings by calculating the semantic shift of words in the training corpus and supporting decision-making regarding re-training. First, we constructed RoBERTa and word2vec models with different training corpus periods using Japanese and English news articles from 2011 to 2021 and observed the time-series performance degradation. Semantic Shift Stability, a metric that can be calculated from the diachronic word semantic shift in the training corpus, was smaller when the performance of the pre-trained models degraded significantly over time. This confirmed that the metric is useful in monitoring applications. The proposed framework has advantages of inferring the cause by using words with significant changes in meaning. The experiments conducted implied the effects of the 2016 U.S. presidential election and the 2020 COVID-19 pandemic. The source code is available at the URL https://github.com/Nikkei/semantic-shift-stability.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!