Heterogeneous Graph Based Extractive Summarization Considering Discourse and Coreference Relations

Modeling the relations between text spans in a document is a crucial yet challenging problem for extractive summarization. Various kinds of relations exist among text spans of different granularity, such as discourse relations between elementary discourse units and coreference relations between phrase mentions. In this paper, we utilize heterogeneous graphs that contain multiple edge/node types to model the input document as well as the various relations among text spans in it. Also, we propose a heterogeneous graph based model for extractive summarization that considers the heterogeneity of the document graph. Experimental results on a benchmark summarization dataset verify the effectiveness of our proposed method.

Both sentence (a) and sentence (b) contain important information about the salient events (highlighted in bold font in the above example). However, there are also plenty of peripheral details that could safely be excluded from the summary without harming the overall understanding of the events (colored in gray in the above example).
To avoid the sentence-level redundancy as shown in the examples above, we perform extraction on a smaller granularity-Elementary Discourse Unit (EDU)-in this work. EDU is a sub-sentence unit that originated from discourse analysis (Mann and Thompson 1988). In the examples above, sentence (a) can be further segmented into two EDUs, and sentence (b) can be segmented into five EDUs. Among them, only EDU a 1 , EDU b 1 , EDU b 3 and EDU b 4 present central concept of the sentences. By performing EDU-level extractive summarization, we can eliminate unwanted trivial details like EDU a 2 , EDU b 2 , and EDU b 5 . The goal of extractive summarization is to identify salient text spans that represent the central ideas of the input document. Thus, it is crucial to model the overall document structure and the various relations between the text spans across the document. Natural language documents have a hierarchical nature, with each level corresponding to a different level of granularity: document, sentences, EDUs, words and phrases (Figure 1(a)). Between text spans of different granularity, there exist many different kinds of relations. For example, discourse relations exist between EDUs within a document and coreference relations exist between mention phrases that correspond to the same real-world entity (Figure 1(b)). The discourse relations between EDUs provide important clues for extractive summarization. The discourse structure describes the high-level linguistic structure of the document, and the concept of nucleus and satellite discourse units in the Rhetorical Structure Theory (Mann and Thompson 1988) defines the relative saliency of discourse units. Thus, the discourse relations are helpful in identifying the salient EDUs of an input document. Also, since salient entities are often mentioned multiple times in a given document, the coreference relations between mention phrases can implicitly capture the narrative structure of the document. Figure 2 shows a document where the protagonist 'Yahya Rashid' is mentioned multiple times. By observing each mention of the entity 'Yahya Rashid' as well as its context, we can observe the narrative revolving around the given protagonist.
Due to its complex nature, modeling the various relations among text spans of a document remains an open challenge. Some recent works capture inter-sentential relations by utilizing recurrent neural networks (RNNs) or Transformer (Vaswani et al. 2017) based encoders on top of the acquired sentence representations (Cheng and Lapata 2016;Nallapati et al. 2017;Liu and Lapata 2019). However, empirical observations show that these sentence-level encoders do not bring much performance gain (Liu and Lapata 2019). Graph structure is an intuitive way to model long-range relations among text spans throughout a document. Early works build connectivity graphs based on content similarity between sentences (Erkan and Radev 2004;Mihalcea and Tarau 2004). Some recent works incorporate discourse or coreference relations into the graph structure and utilize graph neural networks (GNNs) to obtain a high-level representation of text spans (Yasunaga et al. 2017;Xu and Durrett 2019;Xu et al. 2020). Most of these works operate on homogeneous graphs such as Approximate Discourse Graph (ADG) (Christensen et al. 2013) or Rhetorical Structure Theory (RST) (Mann and Thompson 1988) dependency graph. By definition, a homogeneous graph contains only one type of node as well as only one type of edge.
On the other hand, a heterogeneous graph consists of more than one type of nodes or more than one type of edge. As illustrated in Figure 1, the various types of relations exist between text spans of different granularity. To model the various text spans of different granularity (nodes) as well as the various types of relations (edges) among them, a heterogeneous graph is a more natural choice than homogeneous graphs.
In this paper, we propose a novel heterogeneous graph based model for extractive summariza- Our main contribution is threefold: (1) We propose a heterogeneous document graph that incorporates multiple types of relations simultaneously for extractive summarization.
(2) We propose a GAT-based graph encoder that considers the heterogeneity of the document graph. (3) We conducted experiments on summarization benchmark corpora CNN/DailyMail (CNNDM) (Hermann et al. 2015) and New York Times (Sandhaus 2008) and verified the effectiveness of our proposed method.

Document Graph Construction
In this section, we introduce the construction of document graphs. Section 2.1 gives details of the pre-processing steps to obtain the discourse and coreference information required for the construction of document graphs. Section 2.2 gives a brief introduction of the homogeneous document graphs used in previous works as well as their limitations. At last, Section 2.3 and 2.4 introduce two heterogeneous document graphs which tackle the limitations of the homogeneous document graphs.

Pre-processing
Both discourse information and coreference information are helpful in deciding the content saliency, and have been incorporate as external knowledge into extractive summarization systems.
Given an input document D, we first perform the following pre-processing steps and construct the document graphs based on the discourse information and coreference information acquired.
Further, we perform RST discourse parsing to identify the rhetorical relations between the EDUs. In the RST framework, the discourse structure of document D is represented by a continuous constituency tree. For example, Figure 3 illustrates the RST parse tree consisting of five EDUs. The RST parse tree is constructed by continuously merging adjacent discourse units into a larger discourse unit. In addition, the merged discourse units are tagged as either nucleus(N) or satellite(S), which indicates their relative nuclearity/saliency. Nucleus units are considered more salient, while satellite units are less important in content. RST defines two types of rhetorical relations between discourse units: • Mononuclear relation that links a satellite unit and a nucleus unit. Such as relation between EDU 1 and EDU 2 .
• Multinuclear relation that links two nucleus units. Such as the relations between EDU 1−2 and EDU 3−5 .
Similar to (Xu et al. 2020), we convert the RST tree to the dependency form based on the rhetorical relations in the RST tree. The RST dependency graph consists of EDU nodes. In the case of mononuclear relation, the dependency graph contains directed edges from satellite nodes to the corresponding nucleus nodes. In the case of multinuclear relation, we link the nucleus nodes in both directions.

Coreference Resolution
In addition, we perform coreference resolution to identify coreferent relations among mention phrases in the input document D. The goal is to identify mention phrases that refer to the same real-world entity, such as the mentions 'Boston native Mark Wahlberg' and 'Wahlberg' (highlighted in red) in Figure 1. The mentions in document D are clustered into k entities {e 1 , ..., e k }, with each entity e i representing a cluster of mentions among which coreference relations holds.

Homogeneous Document Graphs
In many works on graph-based extractive summarization, discourse information and coreference information are used to build homogeneous document graphs with sentence/EDU nodes.
By definition, homogeneous graphs contain only one type of node and one type of edge.
For example, we can embed discourse relations with the RST dependency graph ( Figure 3).
Also, we can build a homogeneous coreference graph (Figure 4(a)) with EDU nodes in which EDU nodes containing a common entity are linked. However, it is not straightforward to incorporate discourse relations and coreference relations together in a single homogeneous graph.
Some previous works like DiscoBERT (Xu et al. 2020) utilize two different graphs to embed discourse and coreference relations separately. However, this method neglects the interaction between different relation types. Some other works use weighted graphs (Figure 4(b)) like ADG to embed multiple types of textual relations, with the edge weights represents the overall 'strength' of relations between text nodes (Christensen et al. 2013). Although it is possible to combine multiple different relation types within the same graph in this manner, it is difficult to design the proper edge weights.

Fig. 4
Homogeneous and heterogeneous document graphs.

Heterogeneous Document Graph with Multiple Edge Types
To tackle the limitations of homogeneous document graphs, we consider a heterogeneous document graph with multiple edge types. As shown in Figure 4(c), we represent each input document D with a heterogeneous graph G 1 = {V, E}, with V and E being the set of nodes and edges, respectively.
In the pre-processing step, we perform discourse parsing and coreference resolution on D (Section 2.1). Based on the pre-processing results, we construct the heterogeneous document graph G 1 .
E consists of three types of edges: • Discourse edge: Based on the discourse parsing results, we add the RST dependency edges between EDU nodes.
• Coreference edge: Based on the coreference resolution results, the two EDUs which contain mentions of the same entity are connected.
• Same-sentence edge: EDU nodes that belong to the same sentence are connected.
In this work, we use the above-mentioned document graph G 1 as a baseline to compare with our proposed heterogeneous document graph with multiple node types (introduced in Section 2.4).

Heterogeneous Document Graph with Multiple Node Types
In Figure 4(a)-(c), the coreference relations are represented by the edges between EDU nodes.
However, the important information about the coreference entities is overlooked. For example, in Figure 4(c), a coreference edge only indicates that the two EDUs share one or more common entities, but does not give information regarding what entities actually are shared by the EDUs.
To tackle the above limitations, we propose a heterogeneous document graph with multiple node types. As shown in Figure  V contains three types of nodes: • Sentence nodes V s = {s 1 , ..., s m }, with s i representing the i th sentence in D.
• Entity nodes V e = {e 1 , ..., e k }, with each e i representing a phrase entity in D.
We utilize the pre-processing results acquired in Section 2.1 to construct the heterogeneous document graph G 2 as follows: • Discourse edge: We add these RST dependency edges between EDU units to our document graph G 2 to model the discourse structure of the document.
• Coreference edge: We use edges between EDU nodes and entity nodes to embed the coreference relations. If EDU d i contains a mention of entity e j , then we add an undirected edge (d i , e j ) to E. That is, each entity node indirectly connects all EDUs that contain mentions of the entity. In this way, the subgraph around a specific entity node implicitly models the narrative structure related to the entity.
• Same-sentence edge: Based on the discourse segmentation result, we connect each sentence node to its constituent EDU nodes.

Problem Formulation
Given an input document D with n EDUs {d 1 , d 2 , ..., d n }, we formulate EDU-based extractive summarization task as a sequence labeling problem. The model predicts a sequence of binary From the human-written summary, we heuristically obtain the oracle labels {y * 1 , y * 2 , ..., y * n }, which can be used to train our extractive summarization model. Further details will be given in Section 4.1. In the inference stage, the model predicts the binary labels for each EDU in the input document. The EDUs with label y i = 1 will be concatenated to form the summary.

Model Overview
Figure 5 provides an overview of our proposed model. First, the input document D is feed into a Longformer (Beltagy et al. 2020) document encoder. With the self-attention based EDU and entity encoders, we acquire the initial node representation of the heterogeneous document graph (Section 3.2). We then apply a heterogeneous graph encoder to obtain high-level node representations considering the document graph structure (Section 3.3). Finally, we make predictions based on the learned EDU node representations (Section 3.4).

Graph Node Initialization
Following the settings in (Liu and Lapata 2019), we utilize pretrained Longformer (Beltagy et al. 2020) to encode the input document D. We insert the ⟨CLS⟩ and ⟨SEP ⟩ special tokens to the beginning and the end of each sentence s i , respectively. ⟨CLS⟩ special token was originally used to aggregate features from one or a pair of sentences.
With the output vectors of Longformer, we acquire the initial node representations. In the following, we describe how we obtain the initial representations of sentence nodes, EDU nodes, and entity nodes, respectively.

Sentence Node Representations
For each sentence node s i in V s , we take the Longformer output vector of the ⟨CLS⟩ token before s i as the sentence node representation h s i . We expect the ⟨CLS⟩ symbol to aggregate the information of the tokens of the sentences that follows it.

EDU Node Representations
We use a self-attention based EDU encoder to encode each EDU node in V d . Given an where W 1 , b 1 , v 2 are trainable weights.

Entity Node Representations
The structure of the entity encoder is identical to the EDU encoder. For each entity e i in V e , we consider all mentions of it. By taking self-attention among the Longformer output vectors which correspond to tokens of these mentions, we can acquire the entity representation h e i .

Heterogeneous Graph Encoder
We initialize the representation of each node in G with the sentence representations ({h s i }), EDU representations ({h d i }), and entity representations ({h e i }) acquired in Section 3.2. We then use the heterogeneous graph encoder to learn high-level node representations considering the structure of the document graph.

Transformer Sub-layers
We first feed the sentence vectors and EDU vectors to the sentence-level and the EDU-level Transformer (Vaswani et al. 2017) sub-layers.
At the center of the Transformer sub-layer is a multi-head self-attention layer follow by a feed-forward layer. The self-attention mechanism can be seen as a fully-connected version of a graph attention network. To model the interactions among nodes of the same granularity, we utilized two types of Transformer layers, which operate on the sentence-level and the EDU-level, respectively.

Graph Attention Sub-layer
The graph attention sub-layer consists of a multi-head GAT network followed by a feed- forward layer. Taking the document graph and the node representations as input, the purpose of the graph attention sub-layer is to learn a higher-level representation of each node by aggregating information from its neighboring nodes. Here, we introduce two types of GAT networks, the vanilla GAT network and the heterogeneous GAT network. Similar to the one proposed in (Veličković et al. 2018), the vanilla GAT network handles the document graph G as a homogeneous graph and treats all types of nodes in the same way. On the other hand, our proposed heterogeneous GAT network considers the heterogeneity of G and applies different processing to different node types.

Vanilla GAT network
We apply graph attention networks (GAT) (Veličković et al. 2018) to update the node representations in G. For the i th node, we update the representation h i of node i with the representations of its neighbors {h j }: where W a , W q , W k , W v , W k are trainable weights. Figure 7 illustrates an example of graph attention mechanism. The subgraph centering around node EDU 1 is highlighted in Figure 7(a). EDU 1 is connected to a sentence node sent 1 , two EDU nodes EDU 2 and EDU 3 , and two entity nodes entity 1 and entity 2 . With vanilla GAT network, we calculate the attention weight across the five neighbors of EDU 1 and updated the node representation of EDU 1 (h d 1 ) accordingly (Figure 7(b)). Although a single GAT network only considers the first-degree neighbors, by stacking several layers of GAT network, we can obtain a higher-level representation for each node in G.

Heterogeneous GAT Network
The vanilla GAT network disregards the heterogeneity of the document graph and treats different types of nodes identically. Take Figure 7 for example, all three types of neighboring nodes of EDU 1 (sentence, EDU, and entity nodes) use the same equation for calculating attention scores α (Figure 7(b)).
Considering the heterogeneous nature of the document graph G 2 , we introduce a heterogeneous version of the GAT network (Figure 7(c)  For example, the attention score α between sent 1 and EDU 1 is calculated with the key matrix for sentence nodes W s k , while the attention score α between entity 1 and EDU 1 is calculated with the key matrix for entity nodes W e k . As for the baseline document graph G 1 , we consider a heterogeneous GAT network that considers the multiple edge types. We introduce binary variables d ij , c ij and s ij to indicate the existence of discourse edge, coreference edge and same-sentence edge between node i and node j. For example, if there is a discourse edge between node i and node j, then d ij = 1, otherwise, d ij = 0. The attention weight between node i and node j is calculated as follows:

Prediction Layer
We feed the final representation of the EDU nodes (h d i ) to the prediction layer with sigmoid activation to predict binary labels:ŷ The training loss of the model is the binary cross-entropy loss L against the oracle extraction labels: in which {y * i } are the oracle labels and {ŷ i } are the binary labels predicted by the model.

Dataset
We evaluated our proposed model on the benchmark CNN/DailyMail dataset (Hermann et al. 2015) and New York Times dataset (Sandhaus 2008 EDUs.

Pre-processing
We used the Stanford CoreNLP (Manning et al. 2014) to split sentences. Further, we used the RST discourse parser proposed by (Ji and Eisenstein 2014) for both discourse segmentation and discourse parsing. For coreference resolution, we used the spanBERT-based (Joshi et al. 2020) version of the end-to-end coreference resolver proposed by (Lee et al. 2017).

Hyper-parameter Settings
We used the 'longformer-base-4096' version of Longformer to encode the input document.
The length of each document is truncated to 1024 BPEs. The hidden size of the EDU encoder and the entity encoder is 128. Based on the evaluation losses on the validation set, we set the number of stacked graph encoding layers to N = 2. For both Transformer sub-layers and the graph attention sub-layers, the number of attention heads is set to 8, with each head having a hidden size of 96.

Training and Evaluation
During training, we used a batch size of 20. We used Adam optimizer with β 1 = 0.9 and β 2 = 0.999 and followed the learning rate (lr) scheduling in (Vaswani et al. 2017) with warm-up of 4000 steps (n warmup ): lr = 2e −3 min(n step −0.5 , n step n warmup −1.5 ) All models are trained for 60000 steps. We selected the top-3 checkpoints based on the evaluation losses on the validation set and report the average scores of them on the test set.
During the inference phase, the trained model is used to obtain likelihood score for each EDU. The top-5 EDUs with the highest likelihood scores are concatenated to generate the final summary.
We also perform trigram blocking in the inference phase, which is a simple yet effective way to reduce redundancy in extractive summarization.
We adopt ROUGE (Lin 2004) as the evaluation metrics. We report the F 1 scores of the ROUGE 1 , ROUGE 2 , and ROUGE L metrics of our proposed models. DiscoBERT (Xu et al. 2020).

Results on CNN/DailyMail Dataset
As Table 1 shows, all our proposed models outperform the LEAD-3 and all sentence based extractive baseline models. Compared to the BertSum(sent) baseline, our proposed model (hetGAT, 1024) achieved a higher ROUGE on all three metrics (R-1/R-2/R-L). We conclude that EDU-based extraction is a promising direction in extractive summarization.
Our proposed model (hetGAT) also outperform the BertSum(EDU) baseline by a significant margin in all three metrics (R-1/R-2/R-L). This result shows the effectiveness of our graph encoder module to capture the complex relations among the text spans of the input documents.
Compared to the state-of-the-art EDU-extraction model DiscoBERT, our proposed model (hetGAT, 1024) achieved comparable performance on R-1/R-2 metrics and outperformed it on the R-L metrics by 0.45 of F 1 score. However, the performance of the proposed model with 768 maximum input size (hetGAT, 768) has worse performance compare to DiscoBERT on R-1/R-2 metrics. DiscoBERT incorporates a strict RST-based rule during oracle construction and post-processing to ensure discourse consistency. Since the purpose of this paper is to propose a heterogeneous graph based method for modeling text span relations, we will leave the question of discourse consistency to future work. 1 Finally, for both types of input length (768 and 1024), we can observe that the hetGAT model outperforms the vanillaGAT model. This shows the effectiveness of our proposed heterogeneous GAT networks in capturing and aggregating the various text relations in the heterogeneous document graphs.

Results on New York Times Dataset
The results on New York Times dataset is also included in Table 1. Our proposed method (hetGAT) outperforms the BertSum(EDU) baseline in all R-1/R-2/R-L metrics with a significant margin. Compared to the state-of-the-art model DiscoBERT, our proposed model is 1 The SpanBERT based end-to-end coreference resolver (F1 = 0.77, OntoNote corpus) has a better performance than the Stanford CoreNLP coreference resolver (F1 = 0.69, OntoNote corpus) used in DiscoBERT. Compared to DiscoBERT, our proposed model can better adapt to long input documents. First, the homogeneous structure of their work is not efficient in embedding coreference relations. Take the example of an entity with k mentions, the homogeneous coreference graph (like Figure 4(a)) needs to use k(k−1) 2 edges, while our proposed heterogeneous graph only needs k edges to represent the coreference relation. Since both GCN (used in DiscoBERT) and GAT have time complexity O(|E|), the larger number of edges will make it difficult to adapt to longer documents. Also, unlike their GCN-based method, the GAT-based method we adopted is exempted from eigen-compositions and costly matrix operations.
inferior in R-1/R-2 metrics but outperforms DiscoBERT in R-L metrics.
Similar to the results on CNN/Daily Mail dataset, the performance of the hetGAT models are better than the vanillaGAT models, for both maximum input size settings (768 and 1024).

Ablation Study
We conduct ablation studies on the CNN/Daily Mail dataset by removing components from our proposed document graph (heterogeneous document graph G 2 introduced in Section 2.4).
The results of the ablation studies are shown in Table 2.
The first part shows the ablation study of the proposed (hetGAT) model. First, we remove the RST dependency edges between EDU nodes (-discourse). Next, we remove the coreferential edges between EDU nodes and entity nodes (-coref ). We can see that both discourse and coreference information contributes significantly to the model performance, with discourse information being slightly more important than the coreference information. We also try to remove the edges between sentence nodes and their constituent EDU nodes (-sent). However, linking the sentence and EDU nodes does not seem to have a significant impact on model performance.
The second part of Table 2 shows the ablation study of the proposed (vanillaGAT) model.
The results of the ablation study show a similar tendency compared to the hetGAT model.
We can observe from the results that both discourse and coreference information contributes to the model performance separately, but these two different types of information do not aggregate well compared to the case in the hetGAT model. This result illustrates the effectiveness of our proposed heterogeneous GAT network in handling various types of text relations simultaneously, compared to the vanilla GAT networks proposed for homogeneous graphs.

Results of Different Graph Structure
We perform experiments on different graph structures with the CNN/Daily Mail dataset.
We compare the system performance on the baseline document graph G 1 (Section 2.3) and our proposed heterogeneous document graph G 2 (Section 2.4).
As shown in Table 3, for both vanillaGAT and hetGAT, using document graph G 2 gives a better performance then using G 1 . Although both G 1 and G 2 embeds the discourse, coreference and same-sentence information within the document graph, G 2 includes richer information of the coreferent entities and sentences. The experimental results show the effectiveness of introducing extra entity and sentence nodes in the document graph.

Results of Different Hyper-parameter Settings
We perform experiments on different hyper-parameter settings to observe how the model performance changes under the changes of hyper-parameters. Same as ablation study, we use heterogeneous graph G 2 in the following experiments.

Number of stacked graph encoding layers
We modify the number of stacked graph encoding layers (N ) and observe how it affects the performance. We report the performance on the validation set from N = 1 to N = 3 in terms of validation loss, ROUGE-1, ROUGE-2 and ROUGE-L in Figure 8.
As can be observed in Figure 8, we can see that there is a significant performance gap between N = 1 and N = 2. On the other hand, there is no significant difference in model performance from N = 2 to N = 3.
The phenomenon can be explained by the theory of meta-path of the document graph. Meta path is a widely used concept proposed to model the various types of relations between nodes in a heterogeneous graph. A meta path captures a specific type of relations within the given graph. Figure 9 illustrates the various meta paths in our proposed heterogeneous document graph: (a) EDU-EDU meta path represents the discourse relation between discourse units.
(b) EDU-entity-EDU meta path connects two EDU nodes sharing a coreferent entity. This   2-hop meta path describes the coreference relation of the entity in the center.
(c) entity-EDU-entity meta path describes the collocation relation between two entities that appear in the same EDU unit.
(d) EDU-sent-EDU meta path represents the hierarchical structure of a sentence and its consisting EDUs.
The N = 1 model cannot capture the 2-hop meta-paths like EDU-entity-EDU, entity-EDUentity, and EDU-sent-EDU. We speculate that this could account for the performance loss of the N = 1 model.

Maximum Input Length
We modify the maximum input length (in BPEs) and observe how it affects the performance. We report the ROUGE-1, ROUGE-2 and ROUGE-L scores on the validation set with the maximum input size set to 512, 768, 1024, 2048, 4096 BPEs in Figure 10.
Generally, we can observe a performance gain by increasing the maximum input length. The performance gain is significant for the maximum input size under 1024. However, the gain is less significant if we increase the input size further.
The average document length of the CNN/DailyMail dataset is around 864 BPEs, with a standard deviation of 443 BPEs. Also, consider the fact that summary-worthy, salient sentences tend to appear at the beginning of the document. The above two facts altogether support that setting the maximum input length to 1024 BPEs should give satisfactory results.

Qualitative Analysis
We also conduct a qualitative analysis of the proposed model. The effectiveness of discourse relations is more straightforward and widely studied in previous research. Thus, we focus on the analysis of the role of coreference information in our proposed summarization model.  In the heterogeneous document graph, EDUs containing the same entity phrase are indirectly connected through the node of the given entity. By analyzing the output of the full proposed model and the model without coreference information (-coref ), we found that the models rank the importance of coreferent EDUs differently. we argue that the model with coreference information is better in discriminating the important EDUs among all EDUs sharing the same entity.

Graph-based Summarization
Graph-based summarization models have been broadly explored. Early works build connectivity document graphs based on inter-sentential similarity (Erkan and Radev 2004;Mihalcea and Tarau 2004 (Wang et al. 2020) also utilizes a heterogeneous graph of sentence and word nodes. However, neither of the above works incorporates external knowledge into the graph. (Cui et al. 2020) performs sentence-based extractive summarization based on the heterogeneous sentence node and nodes representing latent topics. (Li et al. 2016) illustrates the potential of using EDU as the extraction unit for summarization. (Xu et al. 2020) also introduces an end-to-end EDU-based extractive summarization model. By using a heuristic based on RST dependency structure, they enhanced the grammaticality and discourse consistency of the extracted summary.

Conclusion
In this paper, we proposed a novel heterogeneous graph based model for extractive summarization. By introducing nodes of different granularity, the heterogeneous document graph has the capacity to embed various types of relations between text spans. In addition, we proposed a heterogeneous GAT network that considers the heterogeneous nature of the document graph.
Experiments on the benchmark datasets illustrated the effectiveness of our proposed method.