2025 Volume 32 Issue 1 Pages 55-90
In this study, we report the development and construction of the universal dependencies-based Japanese spoken language treebank (UD_Japanese-CEJC), a conversion of the corpus of everyday Japanese conversation (CEJC) into the universal dependencies format. The CEJC is a large-scale spoken language corpus that includes various everyday Japanese conversations, annotated with word boundaries and morphological information. For the UD Japanese-CEJC, we annotated the CEJC with long-unit morphological and phrase dependency information. It was constructed according to manually refined conversion rules from the CEJC, using morphological information and Bunsetsu phrase-based syntactic dependencies. We examined various issues related to UD constructions in the CEJC by comparing it with a written Japanese corpus and evaluating UD parsing accuracy.