日本語日常会話コーパスのUniversal Dependencies: UD_Japanese-CEJC

大村 舞; 若狭 絢; 松田 寛; 浅原 正幸

doi:10.5715/jnlp.32.55

Abstract

In this study, we report the development and construction of the universal dependencies-based Japanese spoken language treebank (UD_Japanese-CEJC), a conversion of the corpus of everyday Japanese conversation (CEJC) into the universal dependencies format. The CEJC is a large-scale spoken language corpus that includes various everyday Japanese conversations, annotated with word boundaries and morphological information. For the UD Japanese-CEJC, we annotated the CEJC with long-unit morphological and phrase dependency information. It was constructed according to manually refined conversion rules from the CEJC, using morphological information and Bunsetsu phrase-based syntactic dependencies. We examined various issues related to UD constructions in the CEJC by comparing it with a written Japanese corpus and evaluating UD parsing accuracy.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!