2025 年 91 巻 1 号 p. 81-88
Recent advancements in multimodal models integrating various modalities such as images and language have been significant, with large-scale general-purpose models like OFA, Kosmos-2, and Unified-IO gaining particular attention in the image field. These models have shown remarkable achievements in diverse vision tasks by integrating features of images and language, surpassing image-only models. Despite their enhanced performance, driven by increased training data and scale, these models also face challenges with growing size and training costs. Moreover, the non-disclosure of training methodologies and datasets complicates model fine-tuning and reproducibility, posing issues for legitimate evaluations. Addressing these concerns, this study proposes a lightweight large-scale Vision & Language multimodal model using frozen pretrained encoder weights. We introduce a multitask training approach that is efficient in resource-limited settings and employs publicly available datasets for credible evaluations. The application of our model on the Human-Object Interaction task through fine-tuning demonstrated performance comparable to existing large models, while significantly reducing training time due to the model's lightweight design. This paper contributes a lightweight large-scale Vision & Language model feasible for fine-tuning on standard GPUs, an effective multitask training method for constrained environments, and a model that ensures valid evaluations using only public datasets.