日本語医療テキスト平易化の訓練用データセットの構築

堀口 航輝; 梶原 智之; 二宮 崇; 若宮 翔子; 荒牧 英治

doi:10.11517/pjsai.JSAI2024.0_3S1OS7b04

38th (2024)

Session ID : 3S1-OS-7b-04

DOI https://doi.org/10.11517/pjsai.JSAI2024.0_3S1OS7b04

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 38th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 38

Location : [in Japanese]

Date : May 28, 2024 - May 31, 2024

Training Dataset for Japanese Simplification in Medical Domain

*Koki HORIGUCHI, Tomoyuki KAJIWARA, Takashi NINOMIYA, Shoko WAKAMIYA, Eiji ARAMAKI

Author information

Keywords: Medical NLP, Text Simplification, Parallel Corpus Mining

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

We release a large-scale parallel corpus for medical text simplification in Japanese. This corpus can be used to train a text simplification model that paraphrases medical terms into expressions that patients can understand without effort. To address the low-resource problem for this task in Japanese, we automatically extracted 17,300 sentence pairs that were semantically equivalent from both professional and consumer versions of articles in online medical dictionaries. We compared several sentence embedding models for Japanese and extracted simplified sentence pairs from article pairs by embedding-based bipartite graph matching. Experimental results on Japanese text simplification tasks in four domains revealed that models trained on our medical text simplification corpus achieved high performance in medical domains.

Corresponding author

Conference information

Register with J-STAGE for free!