IEICE Transactions on Information and Systems
Online ISSN : 1745-1361
Print ISSN : 0916-8532
Regular Section
An Efficient Multimodal Aggregation Network for Video-Text Retrieval
Zhi LIUFangyuan ZHAOMengmeng ZHANG
Author information
JOURNAL FREE ACCESS

2022 Volume E105.D Issue 10 Pages 1825-1828

Details
Abstract

In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.

Content from these authors
© 2022 The Institute of Electronics, Information and Communication Engineers
Previous article Next article
feedback
Top