Effectiveness of Syntactic Dependency Information for Higher-Order Syntactic Attention Network

Recently, as a replacement of syntactic tree-based approaches, such as tree-trimming, Long Short-Term Memory (LSTM)-based methods have been commonly used to compress sentences because LSTM can generate fluent compressed sentences. However, the performance of these methods degrades significantly while compressing long sentences because they do not explicitly handle long-distance dependencies between the words. To solve this problem, we proposed a higher-order syntactic attention network (HiSAN) that can handle higher-order dependency features as an attention distribution on LSTM hidden states. Furthermore, to avoid the influence of incorrect parse results, we trained HiSAN by maximizing the probability of a correct output together with the attention distribution. Experiments on the Google sentence compression dataset show that our method improved the performance from baselines in terms of F1 as well as ROUGE-1, -2, and -L scores. In subjective evaluations, HiSAN outperformed baseline methods in both readability and informativeness. Besides, in this study, we additionally investigated the performance of HiSAN after training it without any syntactic dependency tree information. The results of our investigation show that HiSAN can compress sentences without relying on any syntactic dependency information while maintaining accurate compression rates, and also shows the effectiveness of syntactic dependency information in compressing long sentences with higher F1 scores.

input and output sides.
To deal with sentences that have deep dependency trees, we focused on the chains of dependency relationships. Fig. 5 shows an example of a compressed sentence with its dependency tree. The topic of this sentence is import agreement related to electricity. Thus, to generate an informative compression, the compressed sentence must retain the country name. In this example, the compressed sentence should retain the phrase "from Kyrgyz Republic and Tajikistan".
Thus, the compressed sentence must also retain the dependency chain comprising the words "import", "resolution", and "signed" because the phrase is a child of this chain. By considering such higher-order dependency chains, the system can implement informative compression. The example in Fig. 5 demonstrates that tracking a higher-order dependency chain for each word helps in compressing long sentences. This paper refers to such dependency relationships by the expression "d-length dependency chains".
To handle a d-length dependency chain for sentence compression with LSTM, we had earlier proposed a technique called the higher-order syntactic attention network (HiSAN) (Kamigaito et al. 2018). HiSAN computes the deletion probability for a given word based on the d-length dependency chain starting from that word. The d-length dependency chain is represented as an attention distribution, learned from automatic parse trees. The attention distribution of HiSAN is calculated from both the input and the compressed sentences. To alleviate the influence of parse errors in automatic parse trees, we learned the attention distribution together with the   Evaluation results on the Google sentence compression dataset (Filippova and Altun 2013) demonstrated that HiSAN improved the sentence compression performance from the baseline in terms of F 1 , ROUGE-1, -2, and -L scores. In particular, HiSAN attained remarkable compression performance with long sentences. HiSAN outperformed the baseline methods even in human evaluations.
Although the attention module of HiSAN can learn context information without relying on external syntactic dependency information, in our preliminary study (Kamigaito et al. 2018), we had evaluated HiSAN in a setting where the syntactic dependency information was utilized in a supervised manner only. Thus, it is still uncertain which network structure or syntactic dependency information contributed to the improvement in the performance. Identifying this point will help to construct a model for more accurate sentence compression. In this study, we additionally investigated the performance of HiSAN, which was trained without any syntactic dependency tree information. Our investigation indicates that HiSAN can compress sentences with high accuracy rates without relying on any syntactic dependency information. It also shows the effectiveness of syntactic dependency information in compressing long sentences with higher F 1 scores while maintaining accurate compression rates.
The remainder of this paper is structured as follows: Section 2 introduces the baseline Seq2Seq method which is used for constructing HiSAN; Section 3 describes the concept of d-length dependency chain and HiSAN network structures; Section 4 describes the comparison between baselines and the variants of HiSAN using automatic and human evaluations; Section 5 analyzes the evaluation results with real generated sentences and attention distributions; Section 6 describes the related work used for this paper; Section 7 concludes the paper.

Baseline Seq2Seq Method
Sentence compression can be regarded as a tagging task on a given sequence of input tokens , where x 0 represents the root node, 3 and the system assigns one out of three specific label types ("keep", "delete", or "end of sentence") as an output label y t to each input LSTM-based approaches for sentence compression are mostly based on either a bidirectional-LSTM (bi-LSTM)-based tagging method (Tagger) (Klerke et al. 2016;Wang et al. 2017;Chen and Pan 2017) or Seq2Seq (Filippova et al. 2015;Tran et al. 2016). Tagger independently predicts the labels in a point estimation manner, whereas Seq2Seq predicts them by considering previously predicted labels. Because Seq2Seq is more expressive than Tagger, we developed HiSAN on the baseline Seq2Seq model.
Our baseline Seq2Seq is a variant of the model proposed by Filippova et al. (2015) , wherein we added the bi-LSTM, an input feeding approach Luong et al. 2015), and a monotonic hard attention method (Yao and Zweig 2015;Tran et al. 2016). As described in the evaluations section, this baseline achieved comparable or even better scores than those reported for the state-of-the-art method in Filippova et al. (2015). Our baseline Seq2Seq model comprised the embedding, encoder, decoder, and output layers.
The embedding layer converts the input tokens x into embeddings e. As reported in Wang et al. (2017), syntactic features are important to learn a generalizable embedding for sentence compression. Following their results, we also introduced syntactic features into the embedding layer.
Specifically, we combined surface token embedding w i , Part-Of-Speech (POS) tag embedding p i , and dependency relation label embedding r i into a single vector as follows: where [] represents the vector concatenation, and e i is an embedding of token x i .
The encoder layer converts the embedding e into a sequence of hidden states h = (h 0 , ..., h n ) using a stacked bi-LSTM as follows: where LST M− → θ and LST M← − θ represent the forward and backward LSTM functions, respectively. The final state of the backward LSTM ← − h 0 is inherited by the decoder as its initial state.
In the decoder layer, the concatenation of a three-bit one-hot vector, determined by a previously predicted label y t−1 , the previous final hidden state d t−1 (explained later), and the input embedding of x t , is encoded into the decoder hidden state − → s t using stacked forward LSTMs.
Contrary to the original softmax attention method, we can deterministically focus on one encoder hidden state h t (Yao and Zweig 2015) to predict y t in the sentence compression task (Tran et al. 2016). 4 In the output layer, label probability is calculated as follows: where W o is the weight matrix of the softmax layer and δ yt is a binary vector where the y t -th element is set to 1 and the other elements to 0.

Higher-order Syntactic Attention Network
The key component of HiSAN is its attention module. Unlike the baseline Seq2Seq, HiSAN employs a packed d-length dependency chain as distributions in the attention module. Section 3.1 explains the packed d-length dependency chain, Section 3.2 describes the network structure of our attention module, and Section 3.3 explains the learning method of HiSAN.

Packed d-length Dependency Chain
The probability of a packed d-length dependency chain is obtained from a dependency graph, which is an edge-factored dependency score matrix (Hashimoto and Tsuruoka 2017;Zhang et al. 2017). First, we explain the dependency graph. Fig. 6 (a) shows an example of the dependency graph to explain a parent attention module. HiSAN represents the dependency graph as an attention distribution generated by the attention module. A probability for each dependency edge is obtained from the attention distribution.   (1) Parent Attention module calculates P parent (x j |x t , x), the probability of x j being the parent of x t , using h j and h t . This probability is calculated for all pairs of x j , x t . The arc in Fig. 7 shows the most probable dependency parent for each child token.

Network Architecture
(2) Recursive Attention module calculates α d,t,j , the probability of x j being the d-th order parent (d denotes the chain length) of x t , by recursively using P parent (x j |x t , x). α d,t,j is also treated as an attention distribution, and used to calculate γ d,t , the weighted sum of h for each length d. For example, a three-length dependency chain of word x 7 with the highest probability is x 6 -x 5 -x 2 . The encoder hidden states h 6 , h 5 , and h 2 , which correspond to the dependency chain, are weighted by calculated parent probabilities α 1,7,6 , α 2,7,5 , and α 3,7,2 , respectively, and then fed to the selective attention module.
(3) Selective Attention module calculates weight β d,t from its length, d ∈ d, for each γ d,t .
d represents a group of chain lengths. β d,t is calculated from the encoder and decoder hidden states. Each β d,t · γ d,t is then summed to Ω t , the output of the selective attention module.
(4) Finally, the calculated Ω t is concatenated and input into the output layer.
Each module is explained in detail in the following subsections. Zhang et al. (2017) formalized dependency parsing as the problem of independently selecting the parent of each word in a sentence. They produced a distribution over the possible parents of each child word by using the attention layer on bi-LSTM hidden layers.

Parent Attention Module
In a dependency tree, a parent has more than one child. Under this constraint, dependency parsing is represented as follows. Given a sentence S = (x 0 , x 1 , ..., x n ), the parent of x j is selected from S \ x i for each token S \ x 0 . Notably, x 0 denotes the root node. The probability of token x j being the parent of token x t in sentence x is calculated as follows: where j ′ indicates all the possible indices of a word in S, v a is a weight vector, and U a and W a are the weight matrices of g.
Different from the attention-based dependency parser, P parent (x j |x t , x) is jointly learned with output label probability P (y | x) in the training phase. The training details are explained in Section 3.3.

Recursive Attention Module
The recursive attention module recursively calculates α d,t,j , the probability of x j being the d-th order parent of x t , as follows: Furthermore, in a dependency parse tree, the root should not have any parent, and a token should not depend on itself. To satisfy these rules, we impose the following constraints on α 1,t,j : The first and second lines of Eq. (10) represent the case where the parent of root is also root.
These constraints imply that root does not have any parent. The third line of Eq. (10) prevents a token from depending on itself. Because the first line of Eq. (9) is a matrix multiplication, Eq.
(9) can be efficiently computed on a CPU and GPU. 5 By recursively using the single attention distribution, it is no longer necessary to prepare additional attention distributions for each order when computing the probability of higher order parents. Furthermore, owing to the nonnecessity of learning multiple attention distributions, using hyper-parameters for adjusting the weight of each distribution in training is not required.
Finally, this method can prevent the problem of sparse higher-order dependency relations in the training dataset.
The α d,t,j calculated above is used to weight the bi-LSTM hidden layer h as follows: Notably, γ d,t is inherited by the selective attention module, as explained in the next section.

Selective Attention Module
To select suitable dependency orders of the input sentence, the selective attention module weights and sums the hidden states γ d,t to Ω t by using weighting parameter β d,t , according to the current context as follows: where W c is the weight matrix of the softmax layer, d is a group of chain lengths, c t is a vector representing the current context, γ 0,t is a zero-vector, and β 0,t indicates the weight when the method does not use the dependency features. Context vector c t is calculated as The calculated Ω t is concatenated and input into the output layer. In detail, d t in Eq. (5) is is also fed to the input of the decoder LSTM at t + 1. It should be noted that considering a high order dependency structure helps to deal with long input sentences. However, when the number of order types increases, estimating their importance is difficult using selective attention. Thus, an appropriate order combination is required.

Objective Function
To alleviate the influence of parse errors, we jointly update the first-order attention distribution α 1,t,k and label probability P (y|x) (Kamigaito et al. 2017). The first-order attention distribution is learned by dependency parse trees. If a t,j = 1 is an edge between parent and child words w j and w t , respectively, on a dependency tree (a t,j = 0 denotes that w j is not a parent of w t .), the objective function of our method can be defined as: where λ is a hyper-parameter that controls the importance of the output labels and parse trees in the training dataset. To investigate the effectiveness of the information from syntactic dependency trees, we set λ = 1.0 for with syntax (w/ syn) setting and λ = 0.0 for without syntax (w/o syn) setting.

Dataset
For evaluation we used the Google sentence compression dataset (Filippova and Altun 2013). 6 This dataset contains information regarding the compression labels, part-of-speech (POS) tags, dependency parents, and dependency relation labels of each sentence. We used the first and last 1,000 sentences of comp-data.eval.json as our test and development datasets, respectively.
Notably, our test dataset is compatible with that used in previous studies (Filippova et al. 2015;Tran et al. 2016;Klerke et al. 2016;Wang et al. 2017). In this study, we trained the baselines and HiSAN on all the sentences present in the file sent-comp.train*.json (200,000 sentences in total). 7,8 In our experiments, we replaced rare words that appear fewer than 10 times in our training dataset with a special token ⟨UNK⟩. The resultant filtered input vocabulary size was 23, 168.

Comparison Methods
For a fair comparison with HiSAN, we used the input features described in Eq. (1) in the following baseline methods: • Tagger: This method regards sentence compression as a tagging task based on bi-LSTM (Klerke et al. 2016;Wang et al. 2017).
• Tagger+ILP: This is an extension of Tagger that integrates integer linear programming (ILP)-based dependency tree trimming (Wang et al. 2017). Here, we set the positive parameter λ to 0.2.
• Bi-LSTM: This method, proposed by Filippova et al. (2015), regards sentence compression as a Seq2Seq translation task. For a fair comparison, we replaced their one-directional LSTM with the more expressive bi-LSTM in the encoder part. The initial state of the decoder was set to the sum of the final states of the forward and backward LSTMs.
• Bi-LSTM-Dep: This is an extension of Bi-LSTM that exploits the features obtained from a dependency tree (named LSTM-PAR-PRES in Filippova et al. (2015)). Following 6 https://github.com/google-research-datasets/sentence-compression 7 Notably, Filippova et al. (2015) used 2,000,000 sentences for training their method, but the datasets are not publicly available. 8 Please note that the large training dataset lacks periods at the end of compressed sentences. To unify the form of compressed sentences in small and large settings, we added periods to the end of compressed sentences in the large training dataset. their work, we fed the word embedding and predicted label of a dependency parent word to the current decoder input of Bi-LSTM.
• Base: This is our baseline Seq2Seq method described in Section 2.
• Attn: This method extends the softmax-based attention method (Luong et al. 2015). We replaced h t in Eq. (6) with the weighted sum calculated with the commonly used concat attention (Luong et al. 2015). • HiSAN w/ syn: As explained in Section 3.3, this setting considers syntactic dependency information for training HiSAN by setting λ to 1.0.
• HiSAN w/o syn: Different from HiSAN w/ syn, this setting does not consider syntactic dependency information for training HiSAN. λ is set to 0.0 for excluding any influence of syntactic dependency information.
We list the number of weight parameters of the comparison methods in Table 1.

Training Details
Following the previous work (Wang et al. 2017), the dimensions of the word embeddings, LSTM layers, and attention layer were set to 100. For the Tagger-style and Seq2Seq-style methods, the depth of the LSTM layer was set to three and two, respectively. In this setting, all methods have a total of six LSTM-layers. The dimensions of POS and dependency-relation label embeddings were set to 40. All parameters were initialized as per the method described by Glorot and Bengio (2010). For all methods, we applied Dropout (Srivastava et al. 2014) to the input of LSTM layers. All the dropout rates were set to 0.3.
During training, the learning rate was tuned with Adam (Kingma and Ba 2014). The initial learning rate was set to 0.001. The maximum number of training epochs was set to 30. All the gradients were averaged in each mini-batch. The maximum mini-batch size was set to 16. The order of mini-batches was shuffled at the end of each training epoch. The clipping threshold of the gradient was set to 5.0. We selected trained models with early stopping for maximizing the persentence accuracy (i.e., how many compressions could be fully reproduced) of the development dataset.
To obtain a compressed sentence, we used greedy decoding instead of beam search decoding, because the latter did not demonstrate better performance in the development dataset. This may relate to the small search space of this task, because the number of labels is only 3. All the methods were implemented in C++ on Dynet (Neubig et al. 2017).

Automatic Evaluation
For automatic evaluation, we used token-level F 1 -measure (F 1 ) as well as the recall scores of ROUGE-1, -2, and -L (Lin and Och 2004). We used the ROUGE-1.5.5 script for calculating the ROUGE scores. When calculating the ROUGE-1 score, we excluded stop words by using options "-n 1 -m -d -s -a" to consider the informativeness of the compressed sentences. When calculating the ROUGE-2 and ROUGE-L scores, we used options "-n 2 -m -d -a" to avoid meaningless collocations. For a fair comparison while calculating ROUGE scores, if a system output exceeded the reference summary byte length, we truncated the exceeding tokens.
We used ∆C = system compression ratio − gold compression ratio to evaluate the closeness of the compression ratio of system outputs to that of gold compressed sentences. Notably, we calculated ∆C in this study after rounding the compression ratios for each method. We used micro-averages for F 1 -measure and compression ratio, and macro-averages for the ROUGE scores, respectively.
To verify the superiority of our methods on long sentences, we additionally reported the scores on sentences longer than the average sentence length (= 27.04) in the test set.
All the results are reported as the average scores and their standard deviation of five trials.

HiSAN (d = {1}) outperformed HiSAN-Dep in F 1 scores in both ALL and LONG settings.
This result shows the effectiveness of joint learning the dependency parse tree and output labels.

Human Evaluation
For human evaluation, we compared the baselines with our method, which achieved the highest F 1 score in the automatic evaluations. Sentences listing only a large number of multiple entities or containing long quotations are unsuitable for manual evaluation owing to various possible interpretations. Thus, we first removed such sentences from the test dataset. Then, we selected the first 100 sentences longer than the average sentence length (= 27.04) of the filtered test set in the order of appearance for human evaluation. It should be noted that when removing such sentences, we identified them by the following rules: the sentence includes more than ten commas; the sentence includes a quotation symbol ("). Notably, we determined the number of commas by verifying that the human evaluation dataset did not contain any removable sentences. We show the statistics of the selected sentences in Fig. 8 and 9.
Similar to Filippova et al. (2015), a compressed sentence was rated by five annotators who were asked to select a rating on a five-point Likert scale, ranging from one to five for readability (Read) and for informativeness (Info). We report the average of these scores from the five raters.
To investigate the differences between the methods, we also compared the baseline methods and HiSAN by using those sentences that displayed different compressions with each method.  The frequencies for each sentence length. Fig. 9 The frequencies for each dependency depth.  Fig. 10 presents the F 1 scores of each method for each sentence length. Notably, the HiSANs presented in this figure is the model that achieved the best F 1 scores in the validation dataset.

Analysis
In results for sentence lengths longer than 45, we can obviously observe that the syntactic information surpasses the performance degradation of Seq2Seq models. Tagger is also effective for such sentences because it does not have a decoder to memorize previously predicted labels for a correct prediction, and thus, it can deal with long sentences. However, the entire compression performance of Tagger is lower than Seq2Seq-based methods. In sentences with 45 words or less, HiSANs achieved high scores regardless of syntactic information. These results show that the network structure of HiSAN itself contributes to the improvement of F 1 scores. Table 5 shows the examples of source sentences and their compressed variants generated by the baseline and HiSANs. We chose the model with the highest F 1 in the test set after five trials.
For both examples, the compressed sentence output achieved by Base is grammatically correct.
However, the informativeness is inferior to that attained by HiSAN w/ syn (d = {1, 2, 4}). The compressed sentence output by HiSAN-Dep in the second example lacks both readability and informativeness. Because HiSAN-Dep employs features obtained from the dependency tree in the pipeline procedure, it is influenced by the parsed tree. Based on this information, we checked the actual parsed tree and observed that the parent of "US" is wrongly predicted as "whistler", instead of "Manning". Zewei et al. (2020) revealed that Seq2Seq produces output related to the words that are indicated by attention. In the case of HiSAN-Dep, we can observe that the content of the wrongly stored parts in the compression is actually related to "whistler". From this observation, we believe that the compression failure in this case is caused by the incorrect parse result. As reported in papers (Klerke et al. 2016;Wang et al. 2017), the F 1 scores of Fig. 10 The F1 scores for each sentence length.

Input
Pakistan signed a resolution on Monday to import 1,300 MW of electricity from Kyrgyz Republic and Tajikistan to overcome power shortage in summer season, said an official press release .

Gold
Pakistan signed a resolution to import 1,300 MW of electricity from Kyrgyz Republic and Tajikistan .

Tagger
Pakistan signed a resolution to import 1,300 MW of electricity Tajikistan to overcome shortage .
Tagger-ILP Pakistan signed resolution to import MW said .

Base
Pakistan signed a resolution to import 1,300 MW of electricity .

HiSAN-Dep (d = {1})
Pakistan signed a resolution to import 1,300 MW of electricity .  Tagger match or exceed those of the Seq2Seq-based methods. The compressed sentence in the first example in Table 5 generated by Tagger is ungrammatical. We believe that this is mainly because Tagger cannot consider the predicted labels of the previous words. Tagger  Considering that the training dataset does not contain such dependency relationships, we can estimate that these arcs are learned for supporting the sentence compression. This result meets our expectation that the dependency chain information is necessary for compressing sentences accurately.
We present a matrix style visualization of the attention distribution for HiSAN w/o syn.
Different from HiSAN w/syn, HiSAN w/o syn displays smooth distributions owing to the lack of supervised dependency tree information. Therefore, visualizing them with the dependency style shown in Fig. 11 is unsuitable for their interpretation. Fig. 12 Table 6 Results of automatic evaluation using sentences with deep dependency trees (deeper than average depth, 8). The notations are the same as in Table 2.

Fig. 11
An example compressed sentence and its dependency graph with HiSAN d = {1, 2, 4}. The gray-colored words represent deleted words. The numbers for each arc represent the probabilistic weight of a relationship between parent and child words. The arcs contained in the parsed dependency tree are located on the top side. The arcs not contained in the parsed dependency tree are located at the bottom.    quality. However, these methods fail to provide an entire parse tree until the decoding phase is finished. Thus, these methods cannot track all the possible parents for each word within the decoding process. Similar to HiSAN, Hashimoto and Tsuruoka (2017) used dependency features as attention distributions; however, different from HiSAN, they used pre-trained dependency relations, and did not take into account the chains of dependencies. They reported an improvement in BLEU scores when their Seq2Seq was trained with syntactic dependency information. Marcheggiani and Titov (2017) and Bastings et al. (2017) considered higher-order dependency relationships in Seq2Seq by incorporating a graph convolution technique (Kipf and Welling 2016) into the encoder. However, the dependency information of the graph convolution technique was still provided in a pipeline manner, and thus, their method cannot work without syntactic tree information. This weak point restricts the availability of their model when a target dataset is not accompanied by syntactic trees. Such a case is possible with out-of-domain datasets and low resourced languages.
Unlike the above methods, HiSAN can capture higher-order dependency features using dlength dependency chains without relying on pipeline processing. In addition, it can continue learning even in the absence of syntactic dependency information. Recently, Zhao et el. (2018) also proposed a method that can avoid the effect of parse fails by incorporating a syntax-based language model into a sequential tagger as a reward for reinforcement learning.  (Kamigaito and Okumura 2020). In addition, this extension included a new decoding method not considered in HiSAN. This decoding method can consider words that will be decoded in the future for compressing sentences more accurately. Furthermore, HiSAN was also extended to solve single document summarization (Ishigaki et al. 2019). The success of these extensions shows the versatility of HiSAN.

Conclusion
In this study, we investigated the performance of HiSAN, our proposed model that incorporates higher-order dependency features into Seq2Seq to compress sentences of all lengths.
Experiments on the Google sentence compression test data showed that HiSAN achieved better results than baseline methods on F 1 as well as ROUGE-1, -2, and -L scores (83.2, 78.6, 71.6, and 78.3, respectively). Particularly, when challenged with longer than average sentences, HiSAN outperformed the baseline methods in terms of F 1 and ROUGE-1, -2, and -L scores.
HiSAN also outperformed the previous methods in both readability and informativeness during interests include natural language processing and statistical relational learning.