Recent Rhetorical Structure Theory (RST)-style discourse parsing methods are trained by supervised learning, requiring an annotated corpus of sufficient size and quality. However, the RST Discourse Treebank, the most extensive corpus, consists of only 385 documents. This is insufficient to learn a long-tailed rhetorical-relation label distribution. To solve this problem, we propose a novel approach to improve the performance of low-frequency labels. Our approach utilized a silver dataset obtained from different parsers as a teacher parser. We extracted agreement subtrees from RST trees built by multiple teacher parsers to obtain a more reliable silver dataset. We used span-based top-down RST parser, a neural SOTA model, as a student parser. In our training procedure, we first pre-trained the student parser by the silver dataset and then fine-tuned it with a gold dataset, a human-annotated dataset. Experimental results showed that our parser achieved excellent scores for nuclearity and relation, that is, 64.7 and 54.1, respectively, on the Original Parseval.
View full abstract