Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
Semantic Shift Stability: Auditing Time-Series Performance Degradation of Pre-trained Models via Semantic Shift of Words in Training Corpus
Shotaro IshiharaHiromu TakahashiHono Shirai
Author information
JOURNAL FREE ACCESS

2024 Volume 31 Issue 4 Pages 1563-1597

Details
Abstract

Auditing time-series performance degradation has become a challenge as researchers and practitioners commonly use pre-trained models. Pre-trained language models typically incur huge costs in training and inference; therefore, considering efficient auditing and retraining schemes becomes important. This study proposes a framework for auditing the time-series performance degradation of pre-trained language models and word embeddings by calculating the semantic shift of words in the training corpus and supporting decision-making regarding re-training. First, we constructed RoBERTa and word2vec models with different training corpus periods using Japanese and English news articles from 2011 to 2021 and observed the time-series performance degradation. Semantic Shift Stability, a metric that can be calculated from the diachronic word semantic shift in the training corpus, was smaller when the performance of the pre-trained models degraded significantly over time. This confirmed that the metric is useful in monitoring applications. The proposed framework has advantages of inferring the cause by using words with significant changes in meaning. The experiments conducted implied the effects of the 2016 U.S. presidential election and the 2020 COVID-19 pandemic. The source code is available at the URL https://github.com/Nikkei/semantic-shift-stability.

Content from these authors
© 2024 The Association for Natural Language Processing
Previous article Next article
feedback
Top