医療領域における日本語マルチラベル文書分類のためのChatGPTベースの適応的データ拡張

坪田 匡史

doi:10.11517/pjsai.JSAI2024.0_2G6GS603

Abstract

Multi-label text classification is a common task type in the medical domain. However, the preparation of the training dataset (annotation) is costly because manual annotations are laborious and require extensive domain-specific knowledge. Here we introduce an automated data augmentation method using ChatGPT, in which new training data are generated according to the ground-truth data (NTCIR-13 MedWeb Japanese corpus). The method is adaptive because it leverages a baseline BERT model fine-tuned with the ground-truth dataset for active filtering of generated training data. The final model trained with the dataset in which the ground truth and augmented data were merged showed a 2.4% improvement in the F1 score compared with the baseline model. The proposed algorithms can help solve multi-label classification problems in the medical domain.

Content from these authors

Favorites & Alerts

Corresponding author

Conference information

Register with J-STAGE for free!