事前学習済み言語モデルの追加学習用データ量の増加に応じた分類性能伸長予測法の提案

倉元 俊輝; 鈴木 潤

doi:10.11517/pjsai.JSAI2023.0_3E5GS202

Abstract

Recently, pre-trained model based on large corpus have been developed and released, and opportunities are expanding to use them to analyze linguistic data such as product review for business purposes by fine-tuning with training data specific to the problem to be solved. However, in business situations, available datasets are not always plentiful due to various constraints, and it's not easy to determine how much data is enough to achieve the target performance. This paper proposes a method to estimate the amount of data required to achieve target performance by predicting the growth of classification performance when the amount of data for additional training increases, based on the results of classification performance of fine-tuned model from hundred to one thousand data initially obtained. Specifically, we show that when a pre-trained model is fine-tuned, the classification performance increases with a similar trend regardless of the original dataset size as the number of epochs is increased. We then verify that approximate formula based on that tendency can be used to estimate the classification performance obtained when the model is trained with 10 times or more training data, even when the initial additional training data is limited.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!