2022 Volume 29 Issue 2 Pages 611-637
End-to-end speech translation (ST) is the task of directly translating source language speech to target language text. It has the potential to generate better translation than those obtained by simply combining automatic speech recognition (ASR) with machine translation (MT). We propose cross-lingual transfer learning for end-to-end ST, where the model parameters are transferred from the ST pretraining stage for one language pair to the ST fine-tuning stage for another language pair. Experiments on the CoVoST 2 and multilingual TEDx datasets in many-to-one settings show that our model outperforms the model that uses English ASR pretraining by up to 2.3 BLEU points. Through an ablation study investigating which layer of the sequence-to-sequence architecture contains important information to transfer, it was demonstrated that the lower layers of the encoder contain language-independent information for cross-lingual transfer. Extensive studies were conducted on (1) ASR pretraining language, (2) ST pretraining language pair, (3) multilingual methods, and (4) model sizes. It was demonstrated that (1) Using the same language as the ASR pretraining language and the ST fine-tuning source language results in good performance. (2) A high-resource language pair is a good choice for the ST pretraining language pair. (3) The proposed method works well in conjunction with multilingual methods. (4) The proposed method can operate with different model sizes.