自然言語処理
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
論文
Towards a Consistent Segmentation Level across Multiple Chinese Word Segmentation Corpora
Fei ChengKevin DuhYuji Matsumoto
著者情報
ジャーナル フリー

2017 年 24 巻 5 号 p. 669-686

詳細
抄録

One of the crucial problems facing current Chinese natural language processing (NLP) is the ambiguity of word boundaries, which raises many further issues, such as different word segmentation standards and the prevalence of out-of-vocabulary (OOV) words. We assume that such issues can be better handled if a consistent segmentation level is created among multiple corpora. In this paper, we propose a simple strategy to transform two different Chinese word segmentation (CWS) corpora into a new consistent segmentation level, which enables easy extension of the training data size. The extended data is verified to be highly consistent by 10-fold cross-validation. In addition, we use a synthetic word parser to analyze the internal structure information of the words in the extended training data to convert the data into a more fine-grained standard. Then we use two-stage Conditional Random Fields (CRFs) to perform fine-grained segmentation and chunk the segments back to the original Peking University (PKU) or Microsoft Research (MSR) standard. Due to the extension of the training data and reduction of the OOV rate in the new fine-grained level, the proposed system achieves state-of-the-art segmentation recall and F-score on the PKU and MSR corpora.

著者関連情報
© 2017 The Association for Natural Language Processing
前の記事 次の記事
feedback
Top