2022 Volume 30 Pages 388-396
Video-guided machine translation (VMT) is a type of multimodal machine translation that uses information from videos to guide translation. However, in the VMT 2020 challenge, adding videos only marginally improved the performance of VMT models compared to their text-only baselines. In this study, we systematically analyze why videos did not guide translation. Specifically, we evaluate the models in input degradation and visual sensitivity experiments and compare the results with a human evaluation using VATEX, which is the dataset used in the VMT 2020 challenge. The results indicate that short and straightforward video descriptions in VATEX are sufficient to perform the translations, which renders the videos redundant in the process. Based on our findings, we provide suggestions on the design of future VMT datasets. Code and human-evaluated data are publicly available for future research.