軽量マルチモーダルモデルの学習効率化と下流タスクへの適用

梁瀬 和哉; 軸屋 敬介; 表 英輝; 土田 裕登; 加藤 邦人

doi:10.2493/jjspe.91.81

抄録

Recent advancements in multimodal models integrating various modalities such as images and language have been significant, with large-scale general-purpose models like OFA, Kosmos-2, and Unified-IO gaining particular attention in the image field. These models have shown remarkable achievements in diverse vision tasks by integrating features of images and language, surpassing image-only models. Despite their enhanced performance, driven by increased training data and scale, these models also face challenges with growing size and training costs. Moreover, the non-disclosure of training methodologies and datasets complicates model fine-tuning and reproducibility, posing issues for legitimate evaluations. Addressing these concerns, this study proposes a lightweight large-scale Vision & Language multimodal model using frozen pretrained encoder weights. We introduce a multitask training approach that is efficient in resource-limited settings and employs publicly available datasets for credible evaluations. The application of our model on the Human-Object Interaction task through fine-tuning demonstrated performance comparable to existing large models, while significantly reducing training time due to the model's lightweight design. This paper contributes a lightweight large-scale Vision & Language model feasible for fine-tuning on standard GPUs, an effective multitask training method for constrained environments, and a model that ensures valid evaluations using only public datasets.

著者関連情報

お気に入り & アラート

閲覧履歴

前身誌

精密機械

精密工学会誌論文集

責任著者(Corresponding author)

J-STAGEへの登録はこちら（無料）