2020 Volume 27 Issue 3 Pages 573-598
Recent work has explored various neural network-based methods for word segmentation and has achieved substantial progress mainly in in-domain scenarios. There remains, however, a problem of performance degradation on target domains for which labeled data is not available. A key issue in overcomming the problem is how to use linguistic resources in target domains, such as unlabeled data and lexicons, which can be collected or constructed more easily than fully-labeled data. In this work, we propose a novel method using unlabeled data and lexicons for cross-domain word segmentation. We introduce an auxiliary prediction task, Lexicon Word Prediction, into a character-based segmenter to identify occurrences of lexical entries in unlabeled sentences. The experiments demonstrate that the proposed method achieves accurate segmentation for various Japanese and Chinese domains.