Journal of the Japanese Association for Digital Humanities
Online ISSN : 2188-7276
articles
Word Segmentation for Classical Chinese Buddhist Literature
Yu-Chun Wang
Author information
JOURNALS FREE ACCESS FULL-TEXT HTML

2020 Volume 5 Issue 2 Pages 154-172

Details
Abstract

With the growth of digital humanities, information technologies take on more important roles in humanities research, including the study of religion. To analyze text for further processing, many text analysis tools treat a word as a unit. However, in Chinese, there are no word boundary markers. Word segmentation is required for processing Chinese texts. Although several word segmentation tools are available for modern Chinese, there is still no practical word segmentation tool for Classical Chinese, especially for Classical Chinese Buddhist literature. In this paper, we adopt unsupervised and supervised learning techniques to build Classical Chinese word segmentation approaches for processing Buddhist literature. Normalized variation of branching entropy (nVBE) is adopted for unsupervised word segmentation. Conditional random fields (CRF) are used to generate supervised models for Classical Chinese word segmentation. The performance of our word segmentation approach achieves an F-score of up to 0.9396. The experimental results show that our proposed method is effective for correctly segmenting most Classical Chinese sentences in Buddhist literature. Our word segmentation method can be a fundamental tool for further text analysis and processing research, such as word embedding, syntactic parsing, and semantic labeling.

Abstract

1 Introduction

Word segmentation has been an essential research topic for Chinese text processing since the late 1990s. Because the Chinese writing system, unlike Western languages, does not use spaces to separate words, word segmentation is required for processing Chinese texts. With the growth of machine learning, numerous researchers have proposed algorithms or approaches to deal with Chinese word segmentation (CWS) and achieved satisfying results. However, these approaches are mostly designed for modern Mandarin Chinese (Huang and Zhao 2007). Most of the Chinese Buddhist scriptures are written in Classical Chinese. The vocabulary and syntax of Classical and modern Chinese are very different. Thus, current Chinese word segmentation tools do not achieve good results on Classical Chinese texts. For the study of Chinese Buddhism, to deal with Chinese Buddhist texts, we need a Classical Chinese word segmentation method that is calibrated for Chinese Buddhist texts. Some current research addresses the demand for Classical Chinese word segmentation (Tsai et al. 2017).

After decades of digitization efforts, the Chinese Buddhist Electronic Text Association (CBETA) comprises the largest open-access collection of Chinese Buddhist texts. CBETA continues to digitize Chinese Buddhist scriptures and texts. We can now analyze these high-quality Chinese digital texts and use computational means to add value to the texts.

Our first goal is to provide word-segmented versions of the Chinese Buddhist texts from CBETA. Word segmentation is the necessary and fundamental basis for advanced analysis of Chinese Buddhist literature, which underlies other forms of analysis such as word embedding, dependency parsing, syntactic parsing, and semantic role labeling. Therefore, in this paper, we propose a practical approach to performing word segmentation for Classical Chinese Buddhist literature.

2 Related Work

Word segmentation has been a highly researched topic in the Chinese natural language processing community in the past two decades. Statistical approaches have dominated this domain in the past decade. The most widely adopted statistical approach is the character-based tagging method that formulates word segmentation as a sequential tagging problem. Character-based tagging was first proposed by Xue (2003). For a given sequence of Chinese characters, Xue applied a Maxent tagger to assign each character with one of four positions-of-character (POC) tags, such as “left boundary,” “right boundary,” “middle,” and “single.” Once the given sequence is tagged, the boundaries of words are also determined. Peng, Feng, and McCallum (2004) first applied a linear-chain conditional random fields (CRF) model to Chinese word segmentation. CRF has been shown to be the optimal algorithm for sequence classification (Rosenfeld, Feldman, and Fresko 2006). This approach has been followed by many later researchers (Tseng et al. 2005; Zhao et al. 2006, 2010; Sun, Wang, and Li 2012; Zhang et al. 2013). With the growth of deep learning, instead of extracting discrete features, many researchers started to use various neural networks for automatic feature learning and discrimination. Zheng, Chen, and Xu (2013) utilized a Convolutional Neural Network (CNN) model for Chinese word segmentation to get a performance comparable to the CRF model. Chen et al. (2015) adopted a Long Short-Term Memory (LSTM) model that can keep longer-distance dependencies to achieve better results in sequence tagging.

In addition to supervised approaches, unsupervised word segmentation is also a burgeoning research topic with various methods. One of the major unsupervised methods is based on “goodness measurement.” Zhao and Kit (2008) adopted several goodness measures in unsupervised models to compare their performance. The goodness measures they used include description length gain (DLG), accessor variety (AV), and boundary entropy (BE). Wang et al. (2011) proposed an iterative model, namely ESA (“Evaluation, Selection, and Adjustment”), with a new goodness measure algorithm which uses a local maximum strategy. However, ESA requires a manually segmented training corpus to find the best values of parameters. Magistry and Sagot (2012) proposed a new model, namely normalized variation of branching entropy (nVBE), based on the variation of branching entropy proposed by Jin and Tanaka-Ishii (2006). They added normalization and Viterbi decoding to remove most of the parameters and thresholds from the model and improve performance over the previous method.

Thus, numerous approaches for word segmentation for modern Mandarin have been proposed, but Classical Chinese word segmentation has received much less attention. Qiu and Huang (2008) proposed a heuristic hybrid Classical Chinese word segmentation method based on a maximum matching algorithm with a Chinese dictionary Hanyu Da Cidian (漢語大辭典). Shi, Li, and Chen (2010) proposed a unified word segmentation and part-of-speech tagging approach based on CRFs for pre-Qin documents. They manually annotated the Zuo Zhuan (左傳) to build the CRF models with predefined features. Lee and Kong (2014) and Wong and Lee (2016) describe the first attempts to tackle the word segmentation problem for Chinese Buddhist literature. Their aim is to build dependency treebanks of the Chinese Buddhist canon based on Tripiṭaka Koreana. Word segmentation is one of the preprocessing steps to construct dependency parsing trees. They adopted CRF models with predefined features based on external dictionaries. The training dataset was compiled from only four sutras with about fifty thousand characters. Although their method achieved satisfactory results, the smaller size of their dataset and the fact that it is not publicly accessible may limit usage by others. In the research described in this paper, we constructed a dataset with wider and more diverse coverage and aim to provide word-segmented Chinese Buddhist texts based on CBETA.

3 Word Segmentation Method

Below we adopt two different methods to develop a suitable word segmentation approach for Classical Chinese Buddhist literature: normalized variation of branching entropy (nVBE) and conditional random fields (CRF). nVBE is an unsupervised word segmentation method which does not require labeled data for training. The CRF method is supervised, which requires human-labeled data to build models. These two approaches have both been proven to achieve good performance on modern Mandarin word segmentation. Building on these approaches, we attempt to develop our own Classical Chinese word segmentation method.

3.1 Model 1: Unsupervised Learning Approach

nVBE is an unsupervised method derived from the method based on branching entropy (Magistry and Sagot 2012). The major idea of branching entropy is based on the hypothesis that if sequences produced by human language were random, we would expect the branching entropy of a sequence (n-grams in a corpus) to decrease as we increase the length of the sequence. Thus, the variation of the branching entropy (VBE) should be negative. If the entropy of successive tokens increases, the location is at a word border. The branching entropy can be defined as follows.

Given an n-gram x 0 . . n = x 0 . .1 x 1 . .2 x n 1 . . n with a left context χ , the right branching entropy (RBE) can be defined as

  
h ( x 0 . . n ) = H ( χ | x 0 . . n ) = x χ P ( x | x 0 . . n ) log P ( x | x 0 . . n )
(1)

Also, the left branching entropy (LBE) is defined as

  
h ( x 0 . . n ) = H ( χ | x 0 . . n )
(2)

where χ is the right context of x 0 . . n .

Next, we can estimate that the variation of branching entropy (VBE) is both left and right direction as follows:

  
δ h ( x 0 . . n ) = h ( x 0 . . n ) h ( x 0 . . n 1 ) δ h ( x 0 . . n ) = h ( x 0 . . n ) h ( x 1 . . n )
(3)

The VBEs are not directly comparable for strings of different lengths and need to be normalized. Following Magistry and Sagot 2012, VBEs are normalized by subtracting the mean of the VBEs of strings of the same length. Then, for different lengths of n-grams, the distributions of the VBEs at different positions inside the n-gram are compared to determine its boundaries. Therefore, for a sequence, the word segmentation problem can be formulated as a maximization problem to find the best segmentation that can generate the maximal nVBE. For a character sequence s , if we call Seg ( s ) the set of all the possible segmentations, then we are looking for

  
arg max W Seq ( s ) w i W a ( w i ) len ( w i )
(4)

where W is the segmentation corresponding to the sequence of words w 0 w 1 w m , and len ( w i ) is the length of a word w i used here to allow us to compare segments which result in a different number of words. Then a ( w i ) is defined as

  
a ( x ) = δ ̃ h ( x ) + δ ̃ h ( x )
(5)

where δ ̃ is the normalized VBE.

3.2 Model 2: Supervised Learning Approach

For the supervised learning approach, we adopt the conditional random field (CRF) method as our learning model. CRFs are undirected graphical models trained to maximize a conditional probability (Lafferty, McCallum, and Pereira 2001). A linear-chain CRF with parameters Λ = λ 1 , λ 2 , defines a conditional probability for a state sequence y = y 1 y T , given that an input sequence x = x 1 x T is

  
P Λ ( y | x ) = 1 Z x exp ( t = 1 T k λ k f k ( y t 1 , y t , x , t ) )
(6)

where Z x is the normalization factor that makes the probability of all state sequences sum to one; f k ( y t 1 , y t , x , t ) is often a binary-valued feature function and λ k is its weight. The feature functions can measure any aspect of a state transition, y t 1 y t , and the entire observation sequence, x , centered at the current time step, t .

3.2.1 CRF Features

As shown in equation (6), to output the probability of the label sequences, CRFs sum up all the feature functions to compute the output probability. For Classical Chinese word segmentation, we define three features as follows:

  1.    Input character
  2.    Character type
  3.    Maximum matching with Buddhist dictionaries

The first feature is the input character itself. The second feature is the type of input characters. Input characters are categorized into three types: Chinese character, numeric character, and punctuation character. The numeric characters are all Arabic numerals. The punctuation characters are Chinese punctuation marks, such as ,。、?!「」, etc.

Classical Chinese Buddhist literature is mainly written in Classical Chinese. However, unlike most Classical Chinese literature, which tends to use single or double syllable words, Classical Chinese Buddhist literature contains many longer words with multiple syllables, especially transliterations from Indic languages. Therefore, we employ an array of Buddhist dictionaries as a feature for CRF models. Each input sentence is segmented by the maximum matching algorithm (Wong and Chan 1996) with a Buddhist dictionary to get tokens. Tokens with a length less than 2 are dropped out. For each character in the input sentence, if the character is in one of the tokens, the feature value will be ‘B’ if it is the first character of the token; ‘E’ if it is the last character of the token; and ‘I’ if it is neither the first nor the last character of the token. If the character is not in any token, the feature value is ‘O’. We use the following dictionaries:

  1.    Fo Guang Online Dictionary (in Chinese) 2 (佛光大辭典)
  2.    Mandarin Dictionary of Taiwan’s Ministry of Education (in Chinese) 3 (教育部重編國語辭典)
  3.    Chinese Buddhist Encyclopedia (in Chinese) 4(中華佛典百科全書)
  4.    DILA Glossary 5

For each dictionary, we treat it as a distinct feature of our CRF model.

3.2.2 CRF Model Training

We formulate the word segmentation problem as a sequence-tagging problem which aims to tag each input character with a predefined label. The classical Buddhist texts are separated into sentences by Chinese punctuation. Then, each character in the sentences is taken as a data row for CRF model. We adopt the {B, I, E, S} tagging approach, which is widely used in Chinese word segmentation. The characters in a sentence are tagged as B class if it is the first character of a word or as I class if it is in a word but neither the first character nor the last character. If the character is the last character of a word, it will be tagged as E class. Otherwise, if the character is a single-character word, it will be tagged as S class.

We adopt the CRF++ open-source toolkit. 6 We train our CRF models with the unigram and bigram features over the input Chinese character sequences. The features are:

  1.    Unigram: s 2 , s 1 , s 0 , s 1 , s 2
  2.    Bigram: s 1 s 0 , s 0 s 1

4 Evaluation

4.1 Dataset

For the unsupervised learning approach, because it does not require annotated data, we use the whole plain text of Taishō Tripiṭaka from CBETA as the training corpus. By contrast, the supervised learning approach based on CRF is in need of human-annotated data for training. Therefore, we compile a dataset with segmentation annotations from two main resources, Middle Chinese Texts from Academia Sinica and a corpus from Beijing Longquan Temple. These two resources are both based on the Taishō Tripiṭaka text in CBETA with annotations indicating word segmentation. Table 1 shows statistics about these two resources.

Since the word segmentation criteria used by these two resources are not consistent, human experts from our team at the Dharma Drum Institute of Liberal Arts (DILA) recheck and unify both to assemble one final dataset. Figure 1 shows an example of the human-annotated texts in our dataset.

4.2 Evaluation Metrics

To measure the performance of our methods on Classical Chinese word segmentation, we adopt the measurements of the SIGHAN Chinese Word Segmentation Bakeoff (Sproat and Emerson 2003). The measurements used by SIGHAN include precision, recall, and F-score, which are defined as follows:

  
P r e c i s i o n = | correct segmented words | | total words the system generates |
(7)
  
R e c a l l = | correct segmented words | | total words in the corpus |
(8)
  
F - s c o r e = 2 × P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l
(9)

Precision measures the ratio indicating how many words the system generates are correct; recall measures the ratio of how many actual words the system can correctly generate; and F-score is a balanced mean of both precision and recall.

4.3 Evaluation Results

For evaluating the supervised model, we adopt the commonly used method of tenfold cross-validation. The whole dataset is randomly partitioned into ten subsets of equal size. Each subset is sequentially used as a test dataset for evaluating the model, and the remaining nine subsets are used as a training dataset. Although the unsupervised model does not require training data, to compare its performance with the supervised model, we also used ten segments for the unsupervised model.

Table 2 shows the performance of the unsupervised model (nVBE) on the test dataset. The overall average F-score is 0.7303. Table 3 shows the performance of the supervised model (CRF) on the same test dataset. The CRF model clearly outperforms the nVBE model: its average F-score is up to 0.9396. The CRF model thus achieves better results in word segmentation for Chinese Buddhist texts.

In order to emphasize the need for and importance of the word segmentation method for Classical Chinese, we perform another experiment. Two popular traditional Chinese word segmentation tools, the CKIP segmenter and Jieba, are used to segment the test dataset. The CKIP segmenter 7 is a Chinese word segmentation tool developed by the Chinese Knowledge and Information Processing (CKIP) lab in Academia Sinica. Jieba 8 is an open-source Chinese word segmentation module which builds directed acyclic graphs for all possible word combinations and then applies dynamic programming techniques to search the best segmentation based on word frequency. Jieba has the ability to take external user-defined dictionaries for segmentation. Thus, we also add Buddhist dictionaries used in CRF features to Jieba. In addition, we also apply the forward maximum matching (FMM) algorithm with dictionaries as a baseline for word segmentation. Table 4 shows the performance comparison between the CKIP segmenter, Jieba, FMM, and our two models. The results show that the general modern Chinese word segmentation tools cannot achieve good results on Classical Chinese Buddhist texts. Our supervised model is more suitable and effective for the Classical Chinese word segmentation.

4.4 Comparison between Model 1 and Model 2

Although model 1 takes the full texts of Taishō Tripiṭaka to build the nVBE model, the performance of nVBE is still not comparable to our model 2, even though model 2 is trained on a smaller dataset. After examining the segmentation results of the nVBE model, we found that the nVBE model tends to combine some fixed expressions into a word. For example, the sentence “如是我聞” (Thus have I heard) is regarded as a word by the nVBE model since these four characters appear together so frequently in Chinese Buddhist texts. By contrast, if some terms, especially transliterations, share the same prefix or suffix, the nVBE model usually partitions the prefix and the following characters. For instance, “毘舍婆” (Viśvabhu) is segmented as “毘舍/婆” because there are several terms sharing the same prefix, such as “毘舍離” (Vaiśālī) and “毘舍闍” (Piśāca). The term “阿耨多羅三藐三菩提” (anuttarā-samyak-sambodhi) partitioned as “阿耨多羅/三藐三/菩提” by the nVBE model also belongs to this case since “菩提” (bodhi) is a high-frequency term in Chinese Buddhist texts. The drawback of the unsupervised nVBE model is that it only takes the variety of the left and right context of n-grams into account. It cannot effectively be aware of differences between morphemes and actual words. The supervised model, model 2 based on CRF, can overcome this problem because it can learn to distinguish them from the annotated data.

4.5 Error Cases of Model 2

Although our model 2 achieves good performance in all three metrics, there are still some cases that cannot be correctly segmented. Table 5 shows the top 20 error cases generated by model 2.

After a detailed examination of the segmentation results, most of these errors can be divided into two types. The first type occurs because segmentation standards in our training dataset were not consistent. For example, the most common and twentieth-most-common term are both “不可” (cannot). In the training data, some of them are annotated as two words “不/可”, but some of them are labeled as a single word “不可”. Despite the different annotations, the meaning of the term “不可” is the same in all of the texts. This type of error is the biggest problem because so many segmentation errors are of this type, such as “不能”, “乃至”, “佛說”, “說法”, and “一切眾生” (shown in table 5). To solve this problem, a consistent segmentation standard and human correction are required. With our segmentation method, these inconsistent annotations can be extracted automatically to reduce the human efforts needed for correction.

The other case is caused by a fundamental ambiguity in word segmentation. Take the thirteenth-most-common term “天人” as an example. In some sentences, it represents a single word “天人” which means “deva”; however, in some other sentences, it should be taken as two words “天/人” because it means “devas and humans.” It is difficult for a supervised learning model to distinguish these two cases without syntactic and semantic clues.

5 Conclusion

Word segmentation is required for many forms of computational analysis. In this paper, we experiment with both unsupervised and supervised approaches to implementing word segmentation for Classical Chinese Buddhist texts. We employ the normalized variation of branching entropy (nVBE) as an example of an unsupervised model and the conditional random fields (CRF) as a supervised model. To evaluate our methods, we constructed a training set from annotated corpora created by Academia Sinica and Beijing Longquan Temple, which are both based on CBETA texts. The F-scores of our unsupervised and supervised approaches average 0.7307 and 0.9396, respectively. The results show that the supervised approach based on the CRF model is more effective for segmenting words in Classical Chinese Buddhist texts and outperforms the nVBE approach, as well as the widely used modern Chinese word segmentation methods CKIP and Jieba, which can be regarded as baseline tests. In the near future, we will continue addressing the annotation inconsistency in the dataset to improve the performance of the supervised approach and apply our method to segment all the texts in CBETA. The word segmentation service and the annotated corpora will also be open access. With this word segmentation method, more advanced natural language processing techniques can be developed for application to Classical Chinese Buddhist texts.

Footnotes

Department of Buddhist Studies, Dharma Drum Institute of Liberal Arts, Taiwan

Accessed May 2019, https://www.fgs.org.tw/fgs_book/fgs_drser.aspx .

Accessed May 2019, http://dict.revised.moe.edu.tw/cbdic/search.htm .

Accessed May 2019, http://buddhism.lib.ntu.edu.tw/DLMBS/search/search_detail.jsp?seq=269284 .

Accessed May 2019, http://glossaries.dila.edu.tw.

Version 0.58, https://taku910.github.io/crfpp/.

Accessed May 2019, http://ckipsvr.iis.sinica.edu.tw.

Accessed May 2019, https://github. com/fxsjy/jieba.

References
 
© Yu-Chun Wang
feedback
Top