Cross-Lingual Transfer Learning for End-to-End Speech Translation

End-to-end speech translation (ST) is the task of directly translating source language speech to target language text. It has the potential to generate better translation than those obtained by simply combining automatic speech recognition (ASR) with machine translation (MT). We propose cross-lingual transfer learning for end-to-end ST, where the model parameters are transferred from the ST pretraining stage for one language pair to the ST fine-tuning stage for another language pair. Experiments on the CoVoST 2 and multilingual TEDx datasets in many-to-one settings show that our model outperforms the model that uses English ASR pretraining by up to 2.3 BLEU points. Through an ablation study investigating which layer of the sequenceto-sequence architecture contains important information to transfer, it was demonstrated that the lower layers of the encoder contain language-independent information for cross-lingual transfer. Extensive studies were conducted on (1) ASR pretraining language, (2) ST pretraining language pair, (3) multilingual methods, and (4) model sizes. It was demonstrated that (1) Using the same language as the ASR pretraining language and the ST fine-tuning source language results in good performance. (2) A high-resource language pair is a good choice for the ST pretraining language pair. (3) The proposed method works well in conjunction with multilingual methods. (4) The proposed method can operate with different model sizes.


Introduction
Speech translation (ST) is the task of translating speech in one language into text in another language. There are two main approaches to this task: the cascaded approach (Stentiford and Steer 1988), in which automatic speech recognition (ASR) and machine translation (MT) are chained together, and the end-to-end approach (Duong et al. 2016;Berard et al. 2016), in which a single sequence-to-sequence model translates directly audio signals into the target text.
The cascaded approach has the problem of error propagation such that possible errors pro- † Graduate School of Informatics, Kyoto University † † National Institute of Information and Communications Technology duced by ASR are carried forward to MT without correction. There is also a loss of prosody information during ASR because texts cannot retain that kind of information. In contrast, the end-to-end approach does not have an error propagation problem because the target text is produced directly, and also prosody information can be used during translation. Therefore, the end-to-end approach has recently gained popularity (Sperber and Paulik 2020;Weiss et al. 2017).
Despite these advantages, the end-to-end approach still suffers from the problem of data scarcity, which can lead to low translation performance. ST corpora contain at most several hundred hours of speech or hundreds of thousands of parallel sentences (Di Gangi et al. 2019), whereas ASR corpora contain a thousand hours of speech (Panayotov et al. 2015) and MT corpora contain billions of parallel sentences (Schwenk et al. 2021).
One methods of addressing data scarcity in end-to-end ST, is transfer learning from ASR (Bansal et al. 2019;Stoian et al. 2020) or MT (Jia et al. 2019). With this method, the pretrained ASR encoder or MT decoder parameters are used to initialize the ST encoder or decoder, which are further fine-tuned on the ST task. In transfer learning, transferring the parameters between similar tasks is better than transferring them between dissimilar tasks (Rosenstein et al. 2005). The difference between ASR and ST tasks (or MT and ST tasks) leaves room for further improvement.
To address the above problem, we propose cross-lingual transfer learning for end-to-end ST by transferring the parameters of ST models from one language pair to another. That is, an ST model is trained for one language pair at the pretraining stage, and the model is subsequently trained on another language pair in the fine-tuning stage (see stages 2 and 3 in Figure 1). Transfer learning is performed between two ST tasks, which is possibly better than transferring from an ASR or an MT task.
Experiments were conducted on the CoVoST 2 dataset (Wang et al. 2021) and multilingual TEDx dataset (Salesky et al. 2021) in many-to-one settings, where English was the target language. Our model outperformed the model that uses English ASR pretraining by up to 2.3 BLEU points.
An ablation study was conducted to determine the layer of the sequence-to-sequence architecture containing important information for transfer. Some of the layers were frozen during the ST fine-tuning stage. It was determined that the lower layers of the encoder contain languageindependent information in many-to-one settings.
Extensive studies were conducted on (1) ASR pretraining language, (2) ST pretraining language pair, (3) multilingual methods, and (4) model sizes. It was demonstrated that: (1) Using the same language as the ASR pretraining language and the ST fine-tuning source language exhibited good performance.
(2) A high-resource language pair is a good choice for the ST pretraining language pair.
(3) The proposed method works well when used with multilingual methods.
(4) The proposed method can work with different model sizes.

End-to-End ST
End-to-end ST is an active research area because of its potential to reduce error propagation and exploit prosody information (Sperber and Paulik 2020). The direct modeling of aligning speech to text was first proposed (Duong et al. 2016), and then it was applied to end-to-end ST (Berard et al. 2016). Although the end-to-end system has good potential, cascaded systems often outperformed end-to-end systems until recently (e.g., (Salesky and Black 2020)). Some recent work shows the superiority of the end-to-end systems over the cascaded ones (Li et al. 2021), and the performance gap between cascaded and end-to-end systems is now closing (Bentivogli et al. 2021).
Transfer learning is one of the major ways to address the data scarcity issue in end-to-end ST.
Attempts have been made to use transfer learning from ASR (Bansal et al. 2019) (Bahar et al. 2019a). The emergence of multilingual ST corpora (such as one-to-many corpus (Di Gangi et al. 2019), many-to-one and one-to-many corpora (Wang et al. 2021), and non-English-centric corpus (Salesky et al. 2021 and pretrained models (Li et al. 2021). Although implicit knowledge transfer might happen in multilingual ST, a further step is required to conduct cross-lingual transfer explicitly.

Transfer Learning for Multilingual MT
Transfer learning is a learning framework that aims to improve the learning of the target predictive function using the knowledge in the source domain and tasks (Pan and Yang 2010).
This has shown to be effective in the field of multilingual MT, especially when the knowledge is transferred from a high-resource language pair to a low-resource language pair (Dabre et al. 2020). Previous work has focused on transferring lexical (e.g., (Zoph et al. 2016)) and syntactic knowledge (e.g., (Murthy et al. 2019)). In addition, language relatedness has been studied in transfer learning for multilingual MT (e.g., (Dabre et al. 2017)). To the best of our knowledge, our work is the first to transfer speech knowledge in multilingual MT.

Preliminaries
We formulate both end-to-end ST and ASR as speech-to-text tasks because they both convert speech input into text output. Figure 2 presents an overview of the speech-to-text task during inference.
In a speech-to-text task, there is a set of speech text pairs, denoted as S = {(s (k) , t (k) )} K k=1 , comprising a source speech s and target text t, where K is the number of training utterances in the ASR or ST corpus. Hereafter, the superscript k is dropped for conciseness. s and t are from the same language if the task is ASR, but from different languages if the task is ST. Here, s = (s 1 , . . . , s M ) is the sequence of acoustic signals in the utterance, and t = (t 1 , . . . t N ) is the sequence of tokens in the utterance, where M is the number of audio samples in the utterance and N is the number of tokens in the utterance. Each utterance usually corresponds to one sentence but can correspond to two or more sentences.
The sequence of audio signals s is processed into a set of speech features s ′ (i.e., a filterbank (Purwins et al. 2019)). First, a sequence of frames with a constant length and stride is obtained from the original signal. The discrete Fourier transform, Mel scale conversion, and log conversion are then sequentially applied to the frames, yielding a set of d-dimensional vectors, called a d-dimensional log-Mel filterbank (Fayek 2016). The vectors are fed into a one-dimensional convolutional subsampler (i.e., CNN) before they are fed into a sequence-to-sequence architecture.
The objective of training a sequence-to-sequence model is to maximize the log-likelihood where S ′ = {(s ′ , t)}. A target utterance is predicted one token at a time: where t i is the predicted token and t <i are the previously predicted tokens.
The Transformer (Vaswani et al. 2017) is one of the most popular sequence-to-sequence architectures for speech-to-text tasks, where encoder layers compute the self-attention of a source sequence and decoder layers compute the cross-attention between a source and a target sequence.
The N e encoder layers and N d decoder layers are stacked to constitute an encoder and a decoder, respectively. Positional encoding is applied to the embedding vectors before they are fed into the encoder and the decoder, so that the model can capture the order of the sequence. An encoder layer comprises a self-attention layer, a feed-forward layer, and a normalization layer. A decoder layer comprises a self-attention layer, a cross-attention layer, a feed-forward layer, and a normalization layer. The output of the final decoder layer leads to a linear layer and a softmax layer to produce the output probability distribution.

Cross-Lingual Transfer Learning
An overview of the proposed method is shown in Figure 1. Three speech-text pairs were used: is a pair for ASR, which means s 1 and t 1 are from the same language. S 2 and S 3 are for ST, which means s 2 and t 2 , and s 3 and t 3 are from different languages. We transfer parameters from a language pair, S 2 = {(s 2 , t 2 )}, to another one, S 3 = {(s 3 , t 3 )}. The focus is on many-to-one translation, which means that s 2 and s 3 are from different languages, whereas t 2 and t 3 are from the same language.

Stage 1: ASR Pretraining
ASR is first trained using S 1 (e.g., En-En in Figure 1) to initialize the ST encoder because pretraining on high-resource ASR can improve low-resource ST (Bansal et al. 2019). The pretrained encoder accounts for most of the improvements in this method, and using only the encoder is a widely accepted approach to improving ST performance (Wang et al. 2021;Salesky et al. 2021); therfore, only the encoder parameters are transferred from stage 1 to stage 2. The training is stopped when the loss on an ASR development set of S 1 does not improve for a predefined number of epochs.

Stage 2: ST Pretraining
The pretrained ASR encoder of stage 1 is loaded, and the ST model is trained using S 2 (e.g., Fr-En in Figure 1). In other words, the model parameters for the CNN, self-attention, feedforward, and normalization layers in the ASR encoder at stage 1 are passed to the ST encoder at this stage for initialization. Subsequently, the ST training is performed using S 2 . The training is stopped when the loss on an ST development set of S 2 does not improve for a predefined number of epochs.

Stage 3: ST Fine-tuning
Finally, the pretrained ST encoder and decoder from stage 2 are loaded, and the ST model is trained on another language pair S 3 (e.g., Es-En in Figure 1). Here, please note that the language pair already used in stage 2 is not used in stage 3. In the initialization of the ST encoder, the CNN, self-attention, feed-forward, and normalization layers are shared, as in stages 1 to 2. For the decoder, in addition to the self-attention, cross-attention, feed-forward, and normalization layers, the vocabulary also needs to be shared with stage 2, to continue training using the same structure. The same vocabulary as that of stage 2 is used in stage 3 because the target language is always the same in the experiments. The CNN layers before the encoder, and the linear and softmax layers after the decoder, are also initialized with those at stage 2. After initialization, the ST fine-tuning is performed using S 3 . The training is stopped when the loss on an ST development set of S 3 does not improve for a predefined number of epochs.

CoVoST 2
In the experiments, CoVoST 2 (Wang et al. 2021), a large-scale multilingual ST dataset that covers translations from various languages to English (many-to-one) and from English to various languages (one-to-many) was used. The part for the many-to-one translation task was used and 13 languages in addition to English was chosen, following previous work (Wang et al. 2020). The main reason we conducted experiments on many-to-one settings is to facilitate the case study.
The number of utterances for each language pair is described in Table 1. The results of the high-resource language pairs are reported in Table 1 and the results for low-resource language pairs are presented in Appendix B. The English ASR data included in the dataset was also used.
The statistics for the ASR data is shown in Appendix A (Table 12).

Multilingual TEDx
The multilingual TEDx (mTEDx) dataset (Salesky et al. 2021), which includes ASR data in 8 languages and ST data in 13 language pairs, was also used in this study. Although the CoVoST 2 dataset includes many language pairs, most of them contain only one or two hours of speech, and the performance of our proposed method is unclear when such language pairs are used. Therefore, experiments were also conducted on this dataset, where the language pairs contained at least 11 hours of speech. Although the mTEDx dataset contains various source or target languages other than English, we focus on translation tasks where the target language is English. We also focus on high-resource language pairs shown in Table 1, and report the results of low-resource language pairs in Appendix B. The number of utterances for each language pair we used is described in Table 1. The statistics of the ASR data is shown in Appendix A (Table   14).

ST Experiments
The baseline of our method is to use English as the ASR source language, and then conduct ST fine-tuning on a language pair (Wang et al. 2020). 1 We compare this with the proposed method, where we first train English ASR, then train ST using one language pair, and then fine-tune the model using another language pair. These experiments were conducted for both the CoVoST 2 and mTEDx datasets.

Settings
Different settings were applied for the CoVoST 2 and the mTEDx datasets, following the specifications in (Wang et al. 2021) for the CoVoST 2 experiments and in (Salesky et al. 2021) for the mTEDx experiments.
The common settings for both datasets are as follows.  (Post 2018) was used to calculate BLEU 3 (the higher the better) (Papineni et al. 2002), chrF2 4 (character n-gram F-score, the higher the better) (Popović 2015), and TER 5 (Translation Edit Rate, the lower the better) (Snover et al. 2006) scores. Statistical significance tests (Koehn 2004) were conducted using SacreBLEU by randomly sampling outputs for 1, 000 times. The number of samples was set to be the same as the number of test utterances in each dataset. One system was regarded as significantly superior to the other when the p-value is smaller than 0.05.
CoVoST 2 In the CoVoST 2 experiments, the parameters were set according to (Wang et al. 2021). For text preprocessing, SentencePiece (Kudo and Richardson 2018) was used to construct a character vocabulary for each language. For training, we used Transformer with 12 encoder layers and 6 decoder layers, where the hidden dimension size was 256 and the number of attention heads was 4. Cross-entropy loss with label smoothing was used as the loss function.
For ASR training (stage 1), the Adam optimizer with an inverse-square-root learning rate scheduler was adopted. The learning rate increased linearly from 0 to 0.001 in 10, 000 updates and then decayed based on the inverse square root of the number of updates. The training was stopped after 100, 000 updates and the parameters were averaged over the last 10 epochs. 6 For ST pretraining (stage 2), we also used the Adam optimizer with an inverse-square-root learning rate scheduler; however, the learning rate increased to 0.002 and then decayed. The training was stopped after observing no improvement in respect of the development set loss over 10 epochs, and the best checkpoint was used for evaluation (for the baseline evaluation) or for the following stage (for the experiments of our proposed method). For ST fine-tuning (stage 3), the same optimizer and the learning rate scheduler as in stage 2 were used. The optimizer (the learning rate scheduler, meters, and DataLoader) was reset when starting this stage. The training was stopped after observing no improvement in respect of the development set loss over 10 epochs, and the best checkpoint was used for evaluation.
For the CoVoST 2 experiments, the language pair at stage 2 was Fr-En, which is the highestresource language pair in the corpus. We conduct an extensive study about what language pair to choose at stage 2 using the mTEDx corpus.
mTEDx The settings in (Salesky et al. 2021) were adopted for the mTEDx experiments. Be-cause there are no English ASR data in the mTEDx dataset, the CoVoST 2 English data were used for stage 1. For text preprocessing, a unigram vocabulary with a vocabulary size of 1, 000 was used for each language. For training, we used Transformer with 6 encoder layers and 3 decoder layers, where the hidden dimension size was 256 and the number of attention heads was 4.
Cross-entropy loss with label smoothing (smoothing parameter 0.1) was used as the loss function.
For all the training of stage 1, stage 2, and stage 3, the Adam optimizer was used with an inverse-square-root learning rate scheduler, where the learning rate increased to 0.002 for 10, 000 updates and then decayed. Dropout was applied with the dropout parameter of 0.3. The gradients were clipped with the threshold of 10.0. The training was stopped after observing no improvement in respect of the development set loss over 10 epochs, or when the training step had reached 60, 000, or when the epochs had reached 200. The checkpoints were averaged over the last 10 epochs and used for evaluation or for the following stage.
Four high-resource language pairs were selected for the language pair to use at stage 2: Es-En, Pt-En, Fr-En, and It-En. The low-resource language pairs (Ru-En and El-En) were not used because there would be little knowledge to transfer from a low-resource language pair to a high-resource language pair. We report the results where we used Es-En as the main results in the following section (5.2.2), and report other results in section 6.3. Table 2 shows the results on the CoVoST 2 dataset. In comparing the baseline (without S 2 ) with the proposed method (with S 2 ), the proposed method outperformed the baseline for the two high-resource language pairs. Especially for Es-En, the BLEU score was improved by 1.6 points. are significantly superior to the baseline results for Pt-En and It-En. Especially for It-En, the BLEU score improved by 2.3 points. For Fr-En, the performance was comparable to that of the baseline.

Results
It might be possible that the language similarity between the source language in stages 2 and 3 influences the ST performance, considering the fact that the transfers from Fr-En to Es-En or from Es-En to It-En are showing the best performance in the CoVoST 2 and the mTEDx dataset, respectively, and that French, Spanish, and Italian are all Romance languages. However, the improvement was not observed in the mTEDx Fr-En data using our proposed method, which does not support this assumption. Language similarities are difficult to define, and further exploration is required.

Manual Evaluation
We manually checked the translation quality of the CoVoST 2 Es-En and the mTEDx Pt-En corpora. We sampled 50 sentences from the test set of each corpus and checked the translations of the baseline and the proposed methods. We evaluated the translations in terms of adequacy and fluency on a scale of 0 to 5. For adequacy, we checked whether the meaning of the predicted utterances conveys the meaning of the reference utterances. Regarding fluency, we checked the naturalness of the translations. The mean and standard deviation of the scores were then calculated.
The results are presented in Table 4. The translations generated by our proposed method were more accurate and natural than the baseline translations, as determined by the scores. En Es-En Pt-En I just wanted to cry. Table 5 Translation examples from the CoVoST 2 Es-En and the mTEDx Pt-En corpora. The transcriptions and references are included in the corpora, and the others are predicted by the models. The same color corresponds to the same meaning. Table 5 shows examples from the CoVoST 2 Es-En and the mTEDx Pt-En corpora. In the first example, although both the baseline and the proposed method could predict a couple of beers correctly, sure or let me were only predicted correctly by our model. The translation generated by the proposed method also looks more natural. In the second example, the proposed method correctly predicted only, and also correctly translated queria, the past tense of querer, as wanted.

Discussion
We discuss which layers are important in cross-lingual transfer in section 6.1, the ASR pretraining language in section 6.2, the ST pretraining language pair in section 6.3, relation to the multilingual models in section 6.4, and model sizes in section 6.5.

Ablation: Freezing Layers
To investigate which layer of the Transformer contains important information to transfer, we froze some layers of the ST encoders and decoders in stage 3. That is, for some of the layers at stage 3, the layers from the loaded checkpoint of stage 2 were used without updating the parameters.
CoVoST 2 was used for these experiments and the performance was investigated when French is used as the source language at stage 2. We froze 6 layers (1-6 or 7-12) of the encoder, all 12 layers of the encoder, or all 6 layers of the decoder. We chose the best epoch in respect of the development set loss for evaluation.
We show the results when S 3 is De-En and Es-En, which are the high-resource language pairs, in Table 6. The first row shows the results of our proposed method (also shown in Table   2). The rest shows the results of this experiment. By comparing the results of freezing layers 1-6 or 7-12 of the encoders with those of freezing layers 1-12 of the encoders, it is observed that the BLEU scores decrease as the number of frozen layers increases. Comparing the results of freezing layers 1-6 of the encoder with those of freezing layers 7-12 of the encoder, the scores decreased significantly when layers 7-12 were frozen. That is, the parameters in the higher layers appear to be updated more than those in the lower layers during ST fine-tuning. This suggests that the higher layers in the ST encoder are more language-specific, and lower layers are likely to contain more language-independent information. Regarding the comparison between the encoder and decoder, the performance was mostly better when freezing layers 1-6 of the decoder than when freezing layers of the encoder. This indicates that the encoder plays a more important role than the decoder when transferring the parameters between ST pretraining stage and ST fine-tuning stages. Note that these statements hold only for many-to-one settings.

Using the ST Source Language for ASR Pretraining
In the experiments so far, English was used for ASR pretraining, which is different from the ST source language. In an end-to-end ST task, only a pair of source speech and target text is required; however, usually, a triplet of source speech, source transcription, and target text are available. Using different languages between the ASR and the ST tasks is a useful method when the ASR language is a high resource language and the transcriptions of the ST source language are not available (Bansal et al. 2019 The mTEDx dataset was used for the experiments. The settings were the same as those described in section 5.2.1, except that the ASR language was set to be the same as the ST source language in the ST fine-tuning stage. We used Es-En as the language pair for ST pretraining because it is the highest-resource language. The results are shown in Table 7. Significance testing showed that the scores of our proposed method underperformed in Pt-En. There were no significant differences in Fr-En and It-En. We believe this is because of the problem known as catastrophic forgetting, where neural networks  Table 7 BLEU, chrF2, and TER scores of the experiments using the ST source language as the ASR language on mTEDx test set. The rows with "B" are the baseline results, and the rows with "P" are the results of the proposed method. †: The score is significantly superior to the other score (p < 0.05).
"forget" the information of the first task after the training of the second task. That is, the information of the source language of S 3 learned at stage 1 was likely to be lost during the training of stage 2. It is concluded that the ASR pretraining method is better than our method when a sufficient amount of source language ASR data are available.

What Language Pair to Choose at the ST Pretraining Stage
We reported the results where we used Es-En as the ST pretraining language pair as the main results in section 5.2.2. Here, we conduct an extensive study to investigate the effect of using different language pairs at stage 2.
The mTEDx dataset was used for the experiments. English was used as the ASR pretraining language. Then, stage 2 was trained with four high-resource language pairs, Es-En, Pt-En, Fr-En, and It-En. Finally, the models were fine-tuned on the three language pairs, which were not used in stage 2. For all the experiments, the settings were the same as those described in section 5.2.1.
The results are presented in Table 8. When S 3 is Pt-En, Fr-En, or It-En, the best performance was obtained when Es-En was used as S 2 . This is presumably because Es-En is the highest-resource language pair in the mTEDx dataset.
When S 3 is Es-En, using Fr-En as S 2 yielded the best performance. Considering that the data size of Pt-En and Fr-En do not differ significantly (Table 1), there seem to be factors other than the data size, which will be further investigated in future work.

Multilingual Models and Application of the Proposed Method
We have proposed to use an additional ST corpus at stage 2. However, this may not be a fair comparison because we use an additional corpus. Therefore, we conducted experiments with the multilingual training framework (Inaguma et al. 2019). After stage 1, we jointly trained the model using both language pairs used at stage 2 and stage 3.
First, a multilingual vocabulary was built with a size of 8, 000, constructed from all language pairs in the mTEDx dataset, following (Salesky et al. 2021). Then, using the En ASR model described in section 5.2.1 to initialize the encoder, we trained multilingual models with two language pairs. Language tags like <es> or <fr> were prepended to the target utterances so that they become the first tokens that the decoders generate. We used validation sets of the two language pairs and validated the losses for each epoch. When either of the validation sets met the condition described in section 5.2.1, the training was stopped. During decoding, the first token (the language tag) was skipped. Other settings were the same as the ones described in section 5.2.1.
As an extension of our cross-lingual transfer learning framework to this multilingual method, we also conducted experiments regarding the multilingual training as stage 2. That is, we further fine-tuned the multilingual model using a single language pair at stage 3. We used the same settings as the ones described in the previous paragraph.
The results are shown in Table 9. The proposed method outperformed the multilingual baseline in almost all evaluations. Comparing these results with those shown in Table 3, the results of the multilingual baseline outperformed those of the proposed method without multilingual ST pretraining. It is concluded that when using the same data, although the multilingual method outperforms the proposed method without multilingual ST, the multilingual method can be applied to our method to further improve performance.

Using the ST Source Language for ASR Pretraining with Multilingual Models
From what we have discussed in section 6.2, fine-tuning multilingual models using the same s 1 as s 3 would produce the best score. Based on this assumption, using the pretrained ASR models described in section 6.2, we trained multilingual models using the translation target language pair and another language pair (e.g., Es-En and Pt-En), and then fine-tuned them using one language pair (e.g., Es-En). We compare the performance of this method with the one without fine-tuning.
The results presented in Table 10 demonstrate that the proposed method performs well also  Table 9 BLEU, chrF2, and TER scores of the multilingual experiments on the mTEDx test set when En is used as s1. The rows with "B" on the first column show the results of the multilingual baseline without fine-tuning, and the rows with "P" show the results with fine-tuning. †: The score is significantly superior to the other score (p < 0.05).
in this setting. Comparing Tables 9 and 10, using the same s 3 as s 1 is better than using En as s 1 despite the difference in corpus size. Significance tests were also performed comparing the corresponding "Proposed" lines of Table 9 and Table 10. It was confirmed that the scores shown in Table 10 are superior to those in Table 9, except for the BLEU and TER scores of It-En where Es-En was also used for the multilingual training at stage 2 (p < 0.05).  Table 10 BLEU, chrF2, and TER scores of the multilingual experiments on the mTEDx test set, using the same s1 as s3. The rows with "B" on the first column show the results of the multilingual baseline without fine-tuning, and the rows with "P" show the results with fine-tuning. †: The score is significantly superior to the other score (p < 0.05).

Changing Model Size
In the mTEDx experiments thus far, we have used models with 6 encoder layers and 3 decoder layers following (Salesky et al. 2021). To investigate how changing the model size affects the performance of our method, we conducted experiments using models with 12 encoder layers and 6 decoder layers, which is the same setting as the one used for the CoVoST 2 dataset.
The results are presented in Table 11. Increasing the model size significantly improved the ST performance when high-resource language pairs were used. Our method outperformed the baseline also in this setting, which means that the method can work with different model sizes.

Conclusion
We proposed a method for end-to-end ST with cross-lingual transfer learning, and showed that our method is effective in many-to-one settings when English is used as the ASR pretraining language. Through an ablation study investigating which Transformer layers contain important information to transfer, it was demonstrated that the lower layers of the encoder are likely to contain language-independent information for cross-lingual transfer. Extensive studies were conducted on (1) ASR pretraining language, (2) ST pretraining language pair, (3) multilingual methods, and (4) model sizes. It was demonstrated that (1) Using the same language as the ASR pretraining language and the ST fine-tuning source language results in good performance.
(2) A high-resource language pair is a good choice for the ST pretraining language pair. (3) The proposed method works well when used in conjunction with multilingual methods. (4) The proposed method can operate with different model sizes. In the future, the effectiveness of cross-lingual transfer learning will be explored in other paradigms for multilingual ST, namely one-to-many and many-to-many ST.

A ASR Statistics and Results
Here, the statistics of the ASR datasets used for ASR pretraining and the results of the ASR experiments are reported. Table 12 lists the statistics for the CoVoST 2 English ASR data. Table   13 lists the results of the ASR experiments on the CoVoST 2 dataset reported in word error rate (WER). Table 14 lists the statistics of the mTEDx ASR data used in the experiments. Table 15 lists the results of the ASR experiments on the mTEDx dataset reported in WER.

B Data Statistics and Experimental Results of Low-resource Language Pairs
In the main text, the statistics and results of high-resource language pairs were presented.
Here, the statistics of low-resource language pairs are presented. Table 16 shows the dataset statistics of the low-resource language pairs in the CoVoST 2 and mTEDx datasets.