2020 Volume 27 Issue 4 Pages 889-931
Although discourse parsing is fundamental to natural language processing, limited research has been conducted on corpus-based discourse parsing in Japanese. Herein, we construct a Japanese corpus annotated with discourse units, discourse connectives, and discourse relations. We propose four strategies of easily and rapidly developing a corpus: (1) selecting web documents with their first three sentences as the target documents, (2) automatically annotating discourse units and connectives, (3) designing a discourse relation tagset consisting of seven classes organized into a two-level hierarchy, and (4) annotating discourse relations through two types of annotators, namely experts and crowd workers. We report that there is significant room for improvement in data annotation performed by crowd workers. Based on this corpus, we develop a Japanese discourse parser. Experimental results show that the proposed parser outperforms previously developed models. We also demonstrate that the automatic recognizer of discourse connectives can be used as a high-quality parser for explicit discourse relations. We implement a recognizer of discourse units and discourse connectives in KNP. We also make the corpus publicly available.