国会会議録のための音声から書き言葉への end-to-end 変換

三村 正人; 河原 達也

doi:10.5715/jnlp.30.88

Abstract

Because conventional automatic speech recognition (ASR) systems are designed to faithfully reproduce utterances word-by-word, their outputs are not necessarily easy to read even when they have few speech recognition errors. To address this issue, we propose a novel ASR approach that outputs readable and clean text directly from speech by removing fillers and disfluent regeons, substituting colloquial expressions with formal ones, insertintg punctuation and recovering omitted particles, and performing other types of appropriate corrections. We formalize this approach as an end-to-end generation of written-style text from speech using a single neural network. We also propose a method to guide the training of this end-to-end model using automatically generated faithful transcripts, as well as a novel speech segmentation strategy based on online punctuation detection. An evaluation using 700 hours of Japanese Parliamentary speech data demonstrates that the proposed direct approach successfully generates clean transcripts suitable for human consumption more accurately at a faster decoding speed than the conventional cascade approach. We also provide an in-depth analysis on the types of edits performed by professional human editors to create the official written records of Japanese Parliamentary meetings, and evaluate the level of achievement of the proposed system in terms of each of the edit types.

Content from these authors

Licensed under CC BY 4.0
https://creativecommons.org/licenses/by/4.0/

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!