Character-to-Word Attention for Word Segmentation

Although limited eﬀort has been devoted to exploring neural models in Japanese word segmentation, much eﬀort has been actively applied to Chinese word segmentation because of the ability to minimize eﬀort in feature engineering. In this work, we propose a character-based neural model that makes joint use of word information useful for disambiguating word boundaries. For each character in a sentence, our model uses an attention mechanism to estimate the importance of multiple candidate words that contain the character. Experimental results show that learning attention to proper words leads to accurate segmentations and that our model achieves better performance than existing statistical and neural models on both in-domain and cross-domain Japanese word segmentation datasets.

An example of candidate words w1:8 retrieved from a vocabulary for the sentence x1:5. Strings in angle brackets "⟨⟩" and in parentheses "()" respectively indicate words' (typical) readings and English translations. The value in each (i, j) represents whether the i-th character is contained by the j-th word, i.e., δij in Eq. (4).
segmentation. Instead, popular approaches are based on statistical learning algorithms, such as conditional random fields (CRFs) (Kudo, Yamamoto, and Matsumoto 2004) and logistic regression (Neubig, Nakata, and Mori 2011). More recently, Morita, Kawahara, and Kurohashi (2015) achieved improved performance by integrating a word-level language model, based on a recurrent neural network (RNN), into their statistical baseline model. However, Kitagawa and Komachi (2018) demonstrated better performance than with an existing statistical segmenter (Neubig et al. 2011) by applying a pure neural model based on a general long short-term memory (LSTM) architecture for sequence labeling. Even so, more specialized neural architectures for word segmentation leave room for further research to clarify neural models' potential. Therefore, we investigate a novel neural word segmentation model that utilizes word information expected to be useful in resolving the ambiguity in word segmentation.
Within a sentence, a character has multiple candidate words that contain the character, but a candidate word's plausibility differs within the target character's context. For example, more than three candidate words exist for characters x 3 , x 4 , and x 5 in the sentence x 1:5 in Fig. 1, so the proper word w 8 must be identified from among the candidates. 1 Motivated by that consideration, we propose a character-based word segmentation model that incorporates word information into a BiLSTM-CRF architecture (Huang, Xu, and Yu 2015;Chen, Qiu, Zhu, Liu, and Huang 2015b). Different from similar work in Chinese word segmentation (Wang and Xu 2017;Yang, Zhang, and Liang 2019), we apply an attention mechanism (Bahdanau, Cho, and Bengio 2015;Luong, Pham, and Manning 2015) to learn and distinguish all possible candidate words' importance for a character within a context. Note that although we demonstrate only experimental results with Japanese datasets, we use no features specific to Japanese, so our model can be applied to other unsegmented languages such as Chinese.
Our contributions are as follows: • To distinguish and leverage possible candidate words' importance in different contexts, we introduce word information and an attention mechanism into a character-based word segmentation framework.
• We empirically demonstrate that learning accurate attention to proper candidate words leads to correct segmentations.
• Compared with existing Japanese word segmentation models, our model achieves better performance on Japanese datasets in both in-domain and cross-domain scenarios.

Task Definition
Word segmentation can be regarded as a character-level sequence labeling task. Given a sentence x = x 1:n = (x 1 , · · · , x n ) of length n, each character x i will be assigned a segmentation label y i of tag set T ; then a label sequence y = y 1:n = (y 1 , · · · , y n ) will be predicted. We employ tag set T = {B, I, E, S}, where B, I, and E represent the beginning, inside, and end of a multi-character word, and S represents a single-character word (Xue 2003). 2

Baseline Model
For our baseline model, we use a BiLSTM-CRF architecture that has been successfully applied to sequence labeling tasks, including word segmentation (Chen et al. 2015b). The model consists of a character embedding layer, recurrent layers, and a CRF layer, as illustrated in Fig. 2.

Character Embedding Layer
Let V c be a character vocabulary. In a given sentence, each character is transformed into a character embedding e c of a d c -dimensional vector by a lookup operation that retrieves the corresponding column of the embedding matrix E c ∈ R dc×|Vc| .

Fig. 2
Architectures of our baseline and proposed models, with common components (light gray) and additional components for the proposed model (dark gray).

Recurrent Layers
A sequence of character embeddings e c 1:n = (e c 1 , · · · , e c n ) is fed into an RNN to derive contextualized representations h 1:n = (h 1 , · · · , h n ), which we call character context vectors. We adopt a stacked (multi-layer) and bidirectional variant of an LSTM (Hochreiter and Schmidhuber 1997) network, which addresses the issue of learning long-term dependencies and the gradient vanishing problem.
Hidden vectors h (l) 1:n of the l-th bidirectional LSTM (BiLSTM) layer are calculated by forward LSTM (LSTM f ) and backward LSTM (LSTM b ): where ⊕ denotes a concatenation operation and h (0) i = e c i . More concretely, each forward LSTM calculates forward hidden vectors − → h 1:n from an input sequence v 1:n = (v 1 , · · · , v n ) of d v -dimensional vectors as follows: where ⊙ denotes element-wise multiplication, σ is the sigmoid function, i, f , and o indicate an input gate, a forget gate, and an output gate,

CRF Layer
A character context vector h i is mapped into a |T |-dimensional vector representing scores of segmentation labels: where W s ∈ R |T |×2dr and b s ∈ R |T | are trainable parameters. Following previous sequence labeling work (Collobert, Weston, Bottou, Karlen, Kavukcuoglu, and Kuksa 2011), we introduce a CRF (Lafferty, McCallum, and Pereira 2001) layer, which has a transition matrix A ∈ R |T |×|T | to give transition scores between adjacent labels. Thus, the score of a label sequence y = y 1:n for a sentence x = x 1:n is calculated as follows: where θ denotes all parameters and s[y] indicates the dimension of a vector s corresponding to a label y. We can find the best label sequence y ⋆ by maximizing the sentence score:

Training Objective
During training, parameters θ of the network are learned by minimizing the negative log likelihood over all sentences in training data D w.r.t θ: Note that the Viterbi algorithm can be used for efficient calculation of a label sequence's probability in Eq.

Proposed Model
To disambiguate word boundaries more effectively than earlier models, we integrate word information into the character-based framework. More specifically, we transform embeddings of multiple candidate words for each character into a fixed-size word vector, which we call a word summary vector, by a word feature composition function (see Fig. 2 for our model's architecture).
In addition to the baseline model's layers, the model comprises a word embedding layer and a word feature composition function.

Word Embedding Layer
Given a character sequence x = x 1:n , we search for all words within a maximum word length corresponding to subsequences of the input sequence from a word vocabulary V w . Then, we obtain a set W x = {w 1 , · · · , w m } of all candidate words. For example, for the given sentence

Composition Functions of Word Features
For character x i , a composition function aggregates embeddings of all candidate words that contain the character into word summary vector a i . We introduce two attention-based composition functions, weighted average (WAVG) and weighted concatenation (WCON), that enable a model to pay more or less attention according to candidate words' importance.
Both functions calculate the importance score u ij from character x i to word w j in W x through bilinear transformation that indicates the interaction between the character context vector h i and the word embedding e w j . Then to normalize scores, a softmax operation obtains the weight α ij ∈ [0, 1]: where W a ∈ R 2dr×dw is a trainable parameter. To simplify equations, we introduce an indicator variable δ ij ∈ {0, 1} that indicates whether the character x i is included in the word w j (Fig. 1).
Next, WAVG and WCON calculate a word summary vector a i as the weighted average and the weighted concatenation of word embeddings, respectively: where {w j } m j=1 = W x and ⊕ (·) indicates the concatenation of given arguments. Let K be the maximum word length, L = ∑ K k=1 k, and i l for the character x i denotes the corresponding index We also use two more variants of composition functions without the attention mechanism, the average function (AVG) and the concatenation function (CON). AVG is a special case of WAVG, where α ij = δ ij / ∑ k δ ik for all (i, j) in Eq. (5). CON is the equivalent function to the word features used in Wang and Xu (2017) and a special case of WCON, where α i,i l = 1 for all (i, i l ) in Eq. (6).
Note that our importance score function in Eq. (3) has the same form as the bilinear variant of the global attention model in Luong et al. (2015), used to calculate alignments between source and target hidden vectors in machine translation. They further evaluate the input-feeding approach that uses previous alignment information in next-time steps. To take into account attended words from previous characters, similar word segmentation approaches might also be useful. We leave this for future work.

Settings
Datasets Using three Japanese datasets, we evaluated our model in both in-domain and crossdomain settings. First, the Balanced Corpus of Contemporary Written Japanese 3 (BCCWJ) Word vocabulary construction Apart from given training and development sets for each dataset, we assumed that no annotated information, including external dictionaries and thirdparty segmenters, was available in our experiments. Therefore, we used the training set and large unlabeled texts to obtain a word vocabulary for our proposed model.
First, we trained a baseline model from each training set and applied it to unlabeled texts.
Then, we regarded the union of auto-segmented words from the texts and gold words from the training set as a word vocabulary. From the auto-segmented texts, we discarded words occurring less than the minimum word frequency threshold f of five, the default value in the Word2Vec used for pre-training word embeddings, as described later in this subsection. We used the noncore section of BCCWJ (BCCWJ-NC), 9 which contains about 5.9 million sentences, as common unlabeled texts for the three datasets.
In the implementation of our proposed model, a word dictionary manages all words in a given word vocabulary. In the training phase, however, the model holds embedding parameters for only a part of those words corresponding to gold words or substrings in training sentences.
In the test phase, the model searches substrings in the test sentences from the dictionary and dynamically loads their word embeddings from the external word embedding model used for initialization. This strategy reduces model size while handling hundreds of thousands of words in the dictionary, as shown later in §5.3.2.
Pre-training of embedding parameters Following previous work (Collobert et al. 2011), we pre-trained word embeddings from large texts and used them to initialize the word embedding matrix in our proposed segmenter. To pre-train word embeddings, we applied the gensim (Řehůřek and Sojka 2010) implementation of Word2Vec (Mikolov, Chen, Corrado, and Dean 2013) to the same texts as those used to construct word vocabularies, i.e., auto-segmented sentences in BCCWJ-NC processed by baseline segmenters. We used the toolkit with the skip-gram algorithm, embedding size 300, the number of iterations 1, and other default parameters, including minimum frequency 5. For words occurring only in a training set, we randomly initialized their embeddings. We fine-tuned all word embeddings during training of the proposed segmenter.
In contrast, we randomly initialized all character embeddings, since pre-trained character embeddings did not improve performance in our preliminary experiments. Table 2 displays the proposed model's hyperparameters. We set the maximum word length K to 4 because this value covered 99% of words in the BCCWJ and JDC development set and because larger values did not further improve performance in our preliminary experiments. 10 The same dropout strategy as in Zaremba, Sutskever, and Vinyals (2015) was applied to non-recurrent connections of recurrent layers. Besides that, we used word vector dropout, which randomly replaces a word embedding to a zero vector when calculating a word summary vector in Eqs. (5) or (6). We used a mini-batch stochastic gradient descent to optimize parameters and decayed the learning rate with a fixed decay rate every epoch after the first five; we trained models for up to 20 epochs and selected the best model on the development set.

Comparison of proposed model variants
We evaluated our baseline and proposed model variants on the three datasets, and  (i.e., true positive or false negative). Second, the attention-based variants achieved performance equivalent to or better than their counterparts without attention. According to McNemar's tests, the improvement of WAVG over AVG on the JDC news set and that of WCON over CON on the JMC Web set were statistically significant. Third, the concat-based variants performed better than their average-based counterparts in all cases, probably because CON and WCON retain word length and character position information. For example, (d w +1)-th to 2d w -th dimensions of a summary vector always represent a word with a length of two ending with a target character (namely x i−1:i for x i ), while AVG and WAVG lose such information.
In  Table 4, we show F1 scores and out-of-vocabulary (OOV) recalls together with, in Table 5, OOV rates of datasets. WCON   Table 5. Although in particular, this domain has many OOV words not in the training vocabulary V train , a large portion (77%) were covered by the auto-segmented word vocabulary V auto (with f = 5 and K = 4), and the model also greatly improved OOV recall. Thus, our model exploited word information not covered by training data while also reducing substantive unknown words. "Random" and "pre-trained" indicate models started respectively with randomly initialized word embeddings and with pre-trained word embeddings.
As shown in Table 4, we also compared our model with other Japanese word segmenters without additional annotated data-a popular statistical Japanese segmenter KyTea (Neubig et al. 2011) and a recent LSTM-based model by Kitagawa and Komachi (2018). 11 Due to effective information on candidate words, our model achieved better performance than these in all domains, because neither used direct word information except for word indicator features.

Detailed Evaluation
In this section, we provide the detailed proposed model's evaluation and analysis. We use the same experimental settings as in §5.1 and report the mean score of three runs for each model unless otherwise specified.

Effect of semi-supervised learning
Our proposed model is a semi-supervised learning method that uses unlabeled texts for pretraining of word embedding parameters. To investigate unlabeled texts' contribution, we evaluated both a pure-supervised version of the proposed model, which began from randomly initialized word embedding parameters, and a semi-supervised version of the baseline model, which additionally used auto-segmented texts through self-training.
As task that predicts segmentation labels from character representations incorporated with word vectors, learning meaningful word representations is difficult. Rather than using an external method such as the skip-gram, another effective method might be to train a model from an auxiliary task, like word-level language modeling, along with the segmentation task. Either way, considering the sparse distribution of words, large amounts of text are probably necessary. From comparison between baseline and proposed models in the semi-supervised setting, we observed limited performance improvements through self-training, thus indicating that our proposed model more effectively utilizes unlabeled texts.

Effect of word frequency and length thresholds
We analyzed how the WCON model's performance changed in various domains according to different word vocabularies in both minimum frequency and maximum length of words. For the minimum frequency threshold f , a model's vocabulary excludes words occurring fewer time than the threshold value in auto-segmented texts. For the maximum length threshold K, a model ignores words whose length exceeds the threshold value. Namely, a smaller frequency threshold and a larger length threshold lead to a larger vocabulary. Maximum word length threshold Second, we fixed the frequency threshold to 5, changed the word length threshold K among {1, 2, · · · , 6}, and evaluated performance for each length of gold words in evaluation sentences. We picked up several test sets from JDC and JMC data, which are source domain data and target domain data with higher OOV rates (Table 8). We also

Minimum word frequency threshold
show performance of the baseline as the model with "K = 0" for reference.
OOV rates for the model with the largest length threshold K = 6 decreased up to 7 points from the model with K = 1, and performance also varied greatly. For each length k of gold words, the model using words of the length (i.e., the model with K ≥ k), tended to outperform the model not using those words (i.e., the model with K < k), as highlighted by gray background in Table 8. Moreover, the model with the larger threshold value often improved performance for shorter words, and, therefore, overall performance. These results suggest that information on words of a particular length was effectively used to disambiguate character sequences of the same or shorter length.
For each data domain, performance was saturated when K = 5 for the news domain (the rate of words whose length k ≥ 6 is 0.35%) and the dining domain (0.6%). In contrast, better performance was obtained when K = 6 for the patent domain (0.9%) and the Web domain (1%).
Especially for domains with many long words, such as loanwords written with katakana, we can expect to achieve robust segmentations by a model with larger maximum word length.

Effect of attention for segmentation performance
To analyze how the attention mechanism affects segmentation performance, in Fig. 3, we show the WCON model's segmentation and attention accuracies for the BCCWJ development set. Segmentation accuracy indicates character-level accuracy of segmentation label prediction.
Attention accuracy is defined as the rate of characters that correctly attend to gold words. 12 Values in "()" denote percentages of words of length k in the data. We highlighted with gray background the results of the model with K ≥ k that outperformed the model with K < k for each k. almost perfect segmentation accuracy was achieved, might have the most easily identified correct labels due to less ambiguity. In case (d), the model successfully paid attention at a rate of more than 93% and achieved much higher segmentation accuracy than in case (b).
In Fig. 3 (ii), we investigated detailed performance in case (d); we divided all examples of characters into intervals from [0, 0.1) to [0.9, 1.0] 13 on the basis of the maximum value of attention weights to (one of the) candidate words and evaluated both accuracies for each interval. 14 As Fig. 3 illustrates, distribution of maximum weight α i,j ⋆ was biased toward a higher value, that is, the case in which α i,j ⋆ ≥ 0.9 corresponded to about 89% of all cases. Both attention and segmentation accuracy improved with increased value of α i,j ⋆ , and therefore, this confidence score properly reflected the model's certainty of prediction. We obtained high segmentation accuracy (99.7%) in the most confident case in which α i,j ⋆ ≥ 0.9.
To examine whether a direct relationship exists between attention and segmentation accuracies, we controlled correctness of attention by artificially changing values of attention weights of the trained model and evaluated segmentation accuracy for each "correct attention probability".
Specifically, on the basis of the correct attention probability threshold p t ∈ [0, 1], a random variable p ∼ Uniform(0, 1), and gold labels, we changed a weight value α ij of the trained WCON model for a character x i and a word w j as follows: where m denotes the number of candidate words {w j } m j=1 for the character x i , L = ∑ K k=1 k = 10 indicates the maximum number of candidate words, g indicates the gold word's index, and j c indicates the index of a randomly chosen candidate word except for the gold word. Namely, given the threshold value p t , the model (forcibly) pays correct attention with the probability p t , while assigning small weights to other candidate words. As in Fig. 3 (iii), segmentation accuracy monotonically improved according to the increase of correct attention probability, indicating that our model tends to adopt candidate word information emphasized by attention weights for segmentation decisions and that learning accurate attention to proper words leads to correct segmentations. Possibly, therefore, overall segmentation performance can be further improved by learning more accurate attention or discarding words with low confidence.

Effect of additional word embeddings from target domains
Aiming to improve cross-domain performance, we tried and evaluated a simple method to enhance our model with target domains' unlabeled texts. Specifically, for each JDC and JMC data, 13 We omitted intervals from [0, 0.1) to [0.3, 0.4) with infrequent examples from the figure. They had only 21 examples in total (0.1% of all), and the corresponding segmentation accuracies were close to 90%. 14 For example, if there are two characters that one attends to its gold word with the weight of 0.95 and the other attends to an incorrect word with the weight of 0.95, then attention accuracy for [0.9, 1.0] is 1/2. Methods denoted with "S" or "S&T" used unlabeled texts in source domains or those in both source and target domains, respectively.
we merged unlabeled texts of source and all target domains, obtained auto-segmented texts by applying the same baseline segmenter described in §5.1, and then trained word embeddings from auto-segmented texts with the Word2Vec toolkit. Finally, we trained from scratch the WCON model initialized with learned word embeddings. We used resources in Table 9 as unlabeled texts for target domains 15 in addition to BCCWJ-NC texts for source domains used in previous experiments. For comparison, we also evaluated the baseline model enhanced with the same source and target unlabeled texts by self-training. constructing a more reliable vocabulary is to combine annotated resources such as lexicon and unlabeled texts in a target domain. We leave this for the future, however.

Segmentation examples
To examine actual sentences' segmentation results by the different methods, we pick up sentence segments (a)-(l) from the JDC's target domain test sets. In Fig. 4

Related Work
Word segmentation For both Chinese and Japanese, word segmentation has been traditionally addressed by applying statistical learning algorithms, such as maximum entropy (Uchimoto, Sekine, and Isahara 2001;Xue 2003), CRFs (Kudo et al. 2004;Peng, Feng, and McCallum 2004;Zhao and Kit 2008), and logistic regression (Neubig et al. 2011).  (2017) proposed a gap-based model to predict whether or not to segment two consecutive characters.
Recent works have utilized word information on a character-based framework. Using word boundary information from auto-segmented texts, for instance, Zhou, Yu, Zhang, Huang, Dai, and Chen (2017) pre-trained character embeddings. Wang and Xu (2017) explicitly introduced word information into their CNN-based model and concatenated embeddings of a character and multiple words corresponding to n-grams (n ranging from 1-4) that include the target character.
Moreover, Yang et al. (2019) proposed a lattice LSTM model with subsequence (i.e., word or subword) information. Their model integrates information on a character and the word ending with the character into an LSTM cell vector for the character using a gate-mechanism.
In Japanese word segmentation, popular approaches are based on statistical learning algorithms. For example, Kudo et al. (2004) used CRFs for Japanese morphological analysis that simultaneously predicts words and parts of speech (POS) by searching an optimal word and POS sequence over a word lattice that enumerates possible candidate sequences for a sentence. Neubig et al. (2011) used logistic regression for their pointwise segmentation method that makes an independent segmentation decision at each pair of characters in a sentence. Other recent work employed neural models. Morita et al. (2015) integrated an RNN language model into a statistical Japanese morphological analysis framework, while Kitagawa and Komachi (2018) applied a pure neural model based on LSTM to word segmentation treated as character-level sequence labeling.
Semi-supervised learning for word segmentation Especially to improve performance on OOV words, semi-supervised learning with unlabeled data has been explored for word segmentation. Typical approaches include self-training (Liu and Zhang 2012), co-training (Zhang, Wang, Sun, and Mansur 2013), statistical features such as accessor variety (Sun and Xu 2011), and frequent substrings (Shen, Kawahara, and Kurohashi 2016). As a common practice in recent neural models, large unlabeled texts have been used to pre-train character/subword/word embeddings (Zheng et al. 2013;Chen et al. 2015b;Zhang et al. 2016;Yang et al. 2019).
LSTM-CRF LSTM-CRF is a popular neural architecture that has been applied to various tagging tasks, including word segmentation (Chen et al. 2015b), POS tagging, and NER (Huang et al. 2015;Ma and Hovy 2016;Rei, Crichton, and Pyysalo 2016). In contrast to our work introducing candidate word information of characters on the character-level labeling task, Ma and Hovy (2016) and Rei et al. (2016) introduced words' internal character information on wordlevel labeling tasks Attention mechanism An attention mechanism (Bahdanau et al. 2015;Luong et al. 2015) was first introduced in machine translation to focus on appropriate parts of a source sentence during decoding. This mechanism has been widely applied to various NLP tasks, including question answering (Sukhbaatar, Szlam, Weston, and Fergus 2015), relation extraction (Lin, Shen, Liu, Luan, and Sun 2016), and natural language inference (Parikh, Täckström, Das, and Uszkoreit 2016). To determine the relative importance of a word itself and the word's internal characters, Rei et al. (2016) introduced a gate-like attention mechanism on their word-based sequence labeling model.

Conclusion
Aiming to contribute to word boundaries' disambiguation, we proposed a word segmentation model that integrates word-level information into a character-based framework. Experimental results show that our model with an attention-based composition function achieved better performance than the model variant without attention and also than the existing Japanese segmentation models on Japanese datasets.
The main findings from our analysis are, first, word information from auto-segmented texts alleviated the unknown word problem and also contributed to robust performance for crossdomain segmentation. Second, the attention mechanism learned appropriate weights for words, leading to accurate segmentation. Third, due to learned attention weights, our model can generate intuitively interpretable segmentation results. In future work, we will explore a more robust method for various domains' texts by exploiting available resources in target domains, such as lexicon and unlabeled or partially-labeled data.