Extraction of named entitiy classes and their relationships from large corpora often involves morphological analysis of target sentences and tends to suffer from out-of-vocabulary words. In this paper we propose a semantic category extraction algorithm called Monaka and its graph-based extention g-Monaka, both of which use character n-gram based patterns as context to directly extract semantically related instances from unsegmented Japanese text. These algorithms also use ``bidirectional adjacent constraints,'' which states that reliable instances should be placed in between reliable left and right context patterns, in order to improve proper segmentation. Monaka algorithms uses iterative induction of instaces and pattens similarly to the bootstrapping algorithm Espresso. The g-Monaka algorithm further formalizes the adjacency relation of character n-grams as a directed graph and applies von Neumann kernel and Laplacian kernel so that the negative effect of semantic draft, i.e., a phenomenon of semantically unrelated general instances being extracted, is reduced. The experiments show that g-Monaka substantially increases the performance of semantic category acquisition compared to conventional methods, including distributional similarity, bootstrapping-based Espresso, and its graph-based extension g-Espresso, in terms of F-value of the NE category task from unsegmented Japanese newspaper articles.
2011 JSAI (The Japanese Society for Artificial Intelligence)