Journal of Natural Language Processing
Online ISSN : 2185-8314
Print ISSN : 1340-7619
ISSN-L : 1340-7619
General Paper (Peer-Reviewed)
End-to-end Simultaneous Speech Translation with Style Tags using Human Simultaneous Interpretation Data
Yuka KoRyo FukudaYuta NishikawaYasumasa KanoKatsuhito SudohSakriani SaktiSatoshi Nakamura
Author information
JOURNAL FREE ACCESS

2025 Volume 32 Issue 2 Pages 404-437

Details
Abstract

Simultaneous speech translation (SimulST) translates speech incrementally, requiring a monotonic input-output correspondence to reduce latency. This is particularly challenging for distant language pairs, such as English and Japanese, as most SimulST models are trained using offline speech translation (ST) data, where the entire speech input is observed during translation. In simultaneous interpretation (SI), a simultaneous interpreter translates source language speech into target language speech without waiting for the speaker to finish speaking. Therefore, the SimulST model can learn SI-style translations using SI data. However, owing to the limited availability of SI data, fine-tuning an offline ST model using SI data may result in overfitting. To address this problem, we propose an efficient training method for the speech-to-text SimulST model using a combination of small SI and relatively large offline ST data. We trained a single model with mixed data by incorporating style tags to instruct the model to generate either SI or offline-style outputs. This approach, called mixed fine-tuning with style tags, can be extended further using the multistage self-training approach. In this case, we use the trained model to generate pseudo-SI data. Our experimental results for several test sets demonstrated that our models trained using mixed fine-tuning and multistage self-training outperformed baselines across various latency ranges.

Content from these authors
© 2025 The Association for Natural Language Processing
Previous article Next article
feedback
Top