歴史的日本語資料を対象とした形態素解析

小木曽 智信; 小町 守; 松本 裕治

doi:10.5715/jnlp.20.727

Abstract

To construct a richly annotated diachronic corpus of Japanese, the morphological analysis of historical Japanese text is required. However, conventional analysis of old Japanese texts with adequate accuracy is impossible. To facilitate such analyses, we extended dictionary entries from UniDic for Contemporary Japanese and prepared training corpora including articles illustrating the literary style of the Meiji Era and literature of the Heian Era, thus creating new dictionaries: “UniDic-MLJ (Modern Literary Japanese)” and “UniDic-EMJ (Early Middle Japanese).” These dictionaries achieve a high accuracy (96–97%) as that required for constructing a diachronic corpus of Japanese. Moreover, we investigated the optimal size of the training corpus for the morphological analysis of historical Japanese text on the basis of the learning curves obtained by using these dictionaries. We confirmed that a 50,000-word corpus achieves an adequate accuracy of over 95%, and even a small-sized corpus (only 5,000 words) is effective as long as the corpus is particularly constructed for the target domain.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!