文書走査を用いた複合名詞解析

久光 徹; 新田 義彦

doi:10.5715/jnlp.5.4_35

Abstract

Compound nouns tend to be important words because a compound noun conveys a lot of information which can even summarize a document. Therefore the analysis of compound nouns can contribute to machine translation, information extraction, or information retrieval. Since compound nouns lack syntactic clues, existing methods have utilized manually written rules and thesauri in order to analyze word dependency structure in compound nouns. Consequently the methods lack robustness in treating open corpora such as newspaper articles which contain a number of unregistered words. This paper presents a thesaurus-free corpus-based approach which scans a corpus with a set of templates and extracts co-occurrence data of the nouns which construct the compound noun. Unregistered words such as abbreviations and short compound nouns are detected in the process of template-matching and the co-occurrence data of the newly found words are additionally extracted, which leads to the robustness and high accuracy of the analysis. The accuracy of the methodwas evaluated using 400 compound nouns of length 5, 6, 7, and 8. The numbers of the correct analysis were 90, 86, 84, and 84 in 100 compound nouns of length 5, 6, 7, and 8 respectively.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!