2024 Volume 31 Issue 3 Pages 984-1014
Named entity recognition (NER) is a fundamental and important task in natural language processing. However, traditional NER methods require large amounts of supervised data. As a result, they cannot respond to real-world demands flexibly, e.g., for extracting categories with varying granularities depending on the user’s requirements. Weakly supervised NER, which uses contexts in which known words occur as pseudo-data, can satisfy the demands for a variety of categories when combined with a large-scale thesaurus. Previous studies on weakly supervised NER have proposed learning methods that are robust with respect to pseudo-supervised data errors. However, the models created using such learning methods suffer from the side effect of making predictions across boundaries between interested and uninterested categories. To mitigate this shortcoming, we propose a method that utilizes all categories in the thesaurus, including those demanded by users, for pseudo-data generation and clarify the usefulness of the holistic knowledge contained in the thesaurus empirically.