Word Segmentation for Classical Chinese Buddhist Literature

Yu-Chun Wang

doi:10.17928/jjadh.5.2_154

Abstract

With the growth of digital humanities, information technologies take on more important roles in humanities research, including the study of religion. To analyze text for further processing, many text analysis tools treat a word as a unit. However, in Chinese, there are no word boundary markers. Word segmentation is required for processing Chinese texts. Although several word segmentation tools are available for modern Chinese, there is still no practical word segmentation tool for Classical Chinese, especially for Classical Chinese Buddhist literature. In this paper, we adopt unsupervised and supervised learning techniques to build Classical Chinese word segmentation approaches for processing Buddhist literature. Normalized variation of branching entropy (nVBE) is adopted for unsupervised word segmentation. Conditional random fields (CRF) are used to generate supervised models for Classical Chinese word segmentation. The performance of our word segmentation approach achieves an F-score of up to 0.9396. The experimental results show that our proposed method is effective for correctly segmenting most Classical Chinese sentences in Buddhist literature. Our word segmentation method can be a fundamental tool for further text analysis and processing research, such as word embedding, syntactic parsing, and semantic labeling.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!