Journal of Advanced Computational Intelligence and Intelligent Informatics
Online ISSN : 1883-8014
Print ISSN : 1343-0130
ISSN-L : 1883-8014
Regular Papers
Improving Domain-Specific NER in the Indonesian Language Through Domain Transfer and Data Augmentation
Siti Oryza Khairunnisa Zhousi ChenMamoru Komachi
Author information
JOURNAL OPEN ACCESS

2024 Volume 28 Issue 6 Pages 1299-1312

Details
Abstract

Named entity recognition (NER) usually focuses on general domains. Specific domains beyond the English language have rarely been explored. In Indonesian NER, the available resources for specific domains are scarce and on small scales. Building a large dataset is time-consuming and costly, whereas a small dataset is practical. Motivated by this circumstance, we contribute to specific-domain NER in the Indonesian language by providing a small-scale specific-domain NER dataset, IDCrossNER, which is semi-automatically created via automatic translation and projection from English with manual correction for realistic Indonesian localization. With the help of the dataset, we could perform the following analyses: (1) cross-domain transfer learning from general domains and specific-domain augmentation utilizing GPT models to improve the performance of small-scale datasets, and (2) an evaluation of supervised approaches (i.e., in- and cross-domain learning) vs. GPT-4o on IDCrossNER. Our findings include the following. (1) Cross-domain transfer learning is effective. However, on the general domain side, the performance is more sensitive to the size of the pretrained language model (PLM) than to the size and quality of the source dataset in the general domain; on the specific-domain side, the improvement from GPT-based data augmentation becomes significant when only limited source data and a small PLM are available. (2) The evaluation of GPT-4o on our IDCrossNER demonstrates that it is a powerful tool for specific-domain Indonesian NER in a few-shot setting, although it underperforms in prediction in a zero-shot setting. Our dataset is publicly available at https://github.com/khairunnisaor/idcrossner.

Content from these authors

This article cannot obtain the latest cited-by information.

© 2024 Fuji Technology Press Ltd.

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license (https://creativecommons.org/licenses/by-nd/4.0/).
The journal is fully Open Access under Creative Commons licenses and all articles are free to access at JACIII official website.
https://www.fujipress.jp/jaciii/jc-about/#https://creativecommons.org/licenses/by-nd
Previous article Next article
feedback
Top