人工知能学会論文誌
Online ISSN : 1346-8030
Print ISSN : 1346-0714
ISSN-L : 1346-0714
原著論文
グラフカーネルを用いた非分かち書き文からの漸次的語彙知識獲得
萩原 正人小川 泰弘外山 勝彦
著者情報
ジャーナル フリー

2011 年 26 巻 3 号 p. 440-450

詳細
抄録

Extraction of named entitiy classes and their relationships from large corpora often involves morphological analysis of target sentences and tends to suffer from out-of-vocabulary words. In this paper we propose a semantic category extraction algorithm called Monaka and its graph-based extention g-Monaka, both of which use character n-gram based patterns as context to directly extract semantically related instances from unsegmented Japanese text. These algorithms also use ``bidirectional adjacent constraints,'' which states that reliable instances should be placed in between reliable left and right context patterns, in order to improve proper segmentation. Monaka algorithms uses iterative induction of instaces and pattens similarly to the bootstrapping algorithm Espresso. The g-Monaka algorithm further formalizes the adjacency relation of character n-grams as a directed graph and applies von Neumann kernel and Laplacian kernel so that the negative effect of semantic draft, i.e., a phenomenon of semantically unrelated general instances being extracted, is reduced. The experiments show that g-Monaka substantially increases the performance of semantic category acquisition compared to conventional methods, including distributional similarity, bootstrapping-based Espresso, and its graph-based extension g-Espresso, in terms of F-value of the NE category task from unsegmented Japanese newspaper articles.

著者関連情報
© 2011 JSAI (The Japanese Society for Artificial Intelligence)
前の記事 次の記事
feedback
Top