人工知能学会全国大会論文集
Online ISSN : 2758-7347
39th (2025)
セッションID: 3P1-OS-46a-03
会議情報

Chain-of-Thought based Object-level Multimodal Instruction Tuning Data Generation
*Julio VIZCARRAYanan WANGZhi LIHao NIUMori KUROKAWA
著者情報
会議録・要旨集 フリー

詳細
抄録

Generating multimodal instruction tuning data with pre-trained LLMs (e.g., Llama 3, GPT-4) has become a standard approach for facilitating multimodal model training (e.g., the LLaVA series). However, recent works tend to generate composition instruction text (e.g., “What did the boy do after he walked towards the woman with a present?”) to make the model struggle to align with the visual context at the various levels (e.g., understanding of low-level features such as object identification and location, and understanding of high-level concepts such as spatial relations and action recognition). Inspired by the concept of chain-of-thought (CoT): step-by-step reasoning to solve complex tasks, in this paper, we proposed a methodology called Chain-Of-Tasks (CoTask) to construct a new multimodal instruction tuning dataset. Our work generated question-answer pairs for a video instruction tuning task (e.g., NExTQA). Our approach can expand the size of the existing dataset by more than 30 times. Furthermore, the generated subtasks cover a diverse range of fine-grained, object-level spatiotemporal reasoning questions, shedding light on ways to improve multimodal model training.

著者関連情報
© 2025 The Japanese Society for Artificial Intelligence
前の記事 次の記事
feedback
Top