2024 Volume 31 Issue 3 Pages 935-957
Speech-to-text translation (ST) translates speech from the source language into text in the target language. Because ST deals with different forms of language, it faces a language style gap between spoken and written language. The gap lies not only between the input speech and the output text but also between the input speech and the bilingual parallel corpora that are often used in ST. These gaps become an obstacle to improving the performance of ST. Spoken-to-written style conversion has been proven to improve cascaded Japanese-English ST by reducing such gaps. Integrating this conversion into end-to-end ST is desirable because of its ease of deployment, improved efficiency, and reduced error propagation compared to cascaded ST. In this study, we construct a large-scale Japanese-English lecture domain ST dataset. We also propose a joint task of speech-to-text spoken-to-written style conversion and end-to-end ST, as well as an interactive-attention-based multi-decoder model for the joint task to improve end-to-end ST. Experiments on the constructed dataset show that our model outperforms a strong baseline.