精密工学会誌
Online ISSN : 1882-675X
Print ISSN : 0912-0289
ISSN-L : 0912-0289
論文
軽量マルチモーダルモデルの学習効率化と下流タスクへの適用
梁瀬 和哉軸屋 敬介表 英輝土田 裕登加藤 邦人
著者情報
ジャーナル フリー

2025 年 91 巻 1 号 p. 81-88

詳細
抄録

Recent advancements in multimodal models integrating various modalities such as images and language have been significant, with large-scale general-purpose models like OFA, Kosmos-2, and Unified-IO gaining particular attention in the image field. These models have shown remarkable achievements in diverse vision tasks by integrating features of images and language, surpassing image-only models. Despite their enhanced performance, driven by increased training data and scale, these models also face challenges with growing size and training costs. Moreover, the non-disclosure of training methodologies and datasets complicates model fine-tuning and reproducibility, posing issues for legitimate evaluations. Addressing these concerns, this study proposes a lightweight large-scale Vision & Language multimodal model using frozen pretrained encoder weights. We introduce a multitask training approach that is efficient in resource-limited settings and employs publicly available datasets for credible evaluations. The application of our model on the Human-Object Interaction task through fine-tuning demonstrated performance comparable to existing large models, while significantly reducing training time due to the model's lightweight design. This paper contributes a lightweight large-scale Vision & Language model feasible for fine-tuning on standard GPUs, an effective multitask training method for constrained environments, and a model that ensures valid evaluations using only public datasets.

著者関連情報
© 2025 公益社団法人 精密工学会
前の記事 次の記事
feedback
Top