Article ID: 2024EAP1080
Large-scale image pre-training models have recently demonstrated strong representation capabilities in spatial information contexts. Prior works apply these models to video action recognition through fully fine-tuning, which is expensive and resource-intensive. To reduce computational costs, some studies have shifted their focus to efficient parameter fine-tuning methods. However, existing efficient fine-tuning methods lack exploration of multi-scale information in videos. In this work, the Multi-scale spatio-temporal Adapter (MST-Adapter) is proposed for parameter-efficient Image-to-Video transfer learning. By freezing the pretrained models and adding the lightweight adapters, we only need to update few parameters, which is highly efficient. In addition, extensive experiments on two video action recognition benchmarks show that our method can learn high-quality video spatio-temporal representations and achieve competitive or even better performance than prior works.