Information and Media Technologies
Online ISSN : 1881-0896
ISSN-L : 1881-0896
Information Systems and Applications
Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources
Oanh Thi TranCuong Anh LeThuy Quang Ha
著者情報
ジャーナル フリー

2010 年 5 巻 2 号 p. 890-909

詳細
抄録

Word segmentation and POS tagging are two important problems included in many NLP tasks. They, however, have not drawn much attention of Vietnamese researchers all over the world. In this paper, we focus on the integration of advantages from several resourses to improve the accuracy of Vietnamese word segmentation as well as POS tagging task. For word segmentation, we propose a solution in which we try to utilize multiple knowledge resources including dictionary-based model, N-gram model, and named entity recognition model and then integrate them into a Maximum Entropy model. The result of experiments on a public corpus has shown its effectiveness in comparison with the best current models. We got 95.30% F1 measure. For POS tagging, motivated from Chinese research and Vietnamese characteristics, we present a new kind of features based on the idea of word composition. We call it morpheme-based features. Our experiments based on two POS-tagged corpora showed that morpheme-based features always give promising results. In the best case, we got 89.64% precision on a Vietnamese POS-tagged corpus when using Maximum Entropy model.

著者関連情報
© 2010 by The Association for Natural Language Processing
前の記事 次の記事
feedback
Top