Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
Paper
Morphological Analysis of Historical Japanese Text
Toshinobu OgisoMamoru KomachiYuji Matsumoto
Author information
JOURNAL FREE ACCESS

2013 Volume 20 Issue 5 Pages 727-748

Details
Abstract

To construct a richly annotated diachronic corpus of Japanese, the morphological analysis of historical Japanese text is required. However, conventional analysis of old Japanese texts with adequate accuracy is impossible. To facilitate such analyses, we extended dictionary entries from UniDic for Contemporary Japanese and prepared training corpora including articles illustrating the literary style of the Meiji Era and literature of the Heian Era, thus creating new dictionaries: “UniDic-MLJ (Modern Literary Japanese)” and “UniDic-EMJ (Early Middle Japanese).” These dictionaries achieve a high accuracy (96–97%) as that required for constructing a diachronic corpus of Japanese. Moreover, we investigated the optimal size of the training corpus for the morphological analysis of historical Japanese text on the basis of the learning curves obtained by using these dictionaries. We confirmed that a 50,000-word corpus achieves an adequate accuracy of over 95%, and even a small-sized corpus (only 5,000 words) is effective as long as the corpus is particularly constructed for the target domain.

Content from these authors
© 2013 The Association for Natural Language Processing
Previous article
feedback
Top