Why Videos Do Not Guide Translations in Video-guided Machine Translation? An Empirical Evaluation of Video-guided Machine Translation Dataset

Zhishen Yang; Tosho Hirasawa; Mamoru Komachi; Naoaki Okazaki

doi:10.2197/ipsjjip.30.388

Zhishen Yang, Tosho Hirasawa, Mamoru Komachi, Naoaki Okazaki

Author information

Keywords: natural language processing, multimodal machine translation, video-guided machine translation, machine translation

JOURNAL FREE ACCESS

2022 Volume 30 Pages 388-396

DOI https://doi.org/10.2197/ipsjjip.30.388

Details

Abstract

Video-guided machine translation (VMT) is a type of multimodal machine translation that uses information from videos to guide translation. However, in the VMT 2020 challenge, adding videos only marginally improved the performance of VMT models compared to their text-only baselines. In this study, we systematically analyze why videos did not guide translation. Specifically, we evaluate the models in input degradation and visual sensitivity experiments and compare the results with a human evaluation using VATEX, which is the dataset used in the VMT 2020 challenge. The results indicate that short and straightforward video descriptions in VATEX are sufficient to perform the translations, which renders the videos redundant in the process. Based on our findings, we provide suggestions on the design of future VMT datasets. Code and human-evaluated data are publicly available for future research.

Corresponding author

Register with J-STAGE for free!