Chain-of-Thought based Object-level Multimodal Instruction Tuning Data Generation

Julio VIZCARRA; Yanan WANG; Zhi LI; Hao NIU; Mori KUROKAWA

doi:10.11517/pjsai.JSAI2025.0_3P1OS46a03

39th (2025)

セッションID: 3P1-OS-46a-03

DOI https://doi.org/10.11517/pjsai.JSAI2025.0_3P1OS46a03

会議情報

主催: The Japanese Society for Artificial Intelligence

会議名: 2025年度人工知能学会全国大会（第39回）

回次: 39

開催地: 大阪国際会議場＋オンライン

開催日: 2025/05/27 - 2025/05/30

Chain-of-Thought based Object-level Multimodal Instruction Tuning Data Generation

*Julio VIZCARRA, Yanan WANG, Zhi LI, Hao NIU, Mori KUROKAWA

著者情報

キーワード: video question answering, video scene graph, instruction data augmentation, MLLM, Knowledge graph

会議録・要旨集フリー

詳細

抄録

Generating multimodal instruction tuning data with pre-trained LLMs (e.g., Llama 3, GPT-4) has become a standard approach for facilitating multimodal model training (e.g., the LLaVA series). However, recent works tend to generate composition instruction text (e.g., “What did the boy do after he walked towards the woman with a present?”) to make the model struggle to align with the visual context at the various levels (e.g., understanding of low-level features such as object identification and location, and understanding of high-level concepts such as spatial relations and action recognition). Inspired by the concept of chain-of-thought (CoT): step-by-step reasoning to solve complex tasks, in this paper, we proposed a methodology called Chain-Of-Tasks (CoTask) to construct a new multimodal instruction tuning dataset. Our work generated question-answer pairs for a video instruction tuning task (e.g., NExTQA). Our approach can expand the size of the existing dataset by more than 30 times. Furthermore, the generated subtasks cover a diverse range of fine-grained, object-level spatiotemporal reasoning questions, shedding light on ways to improve multimodal model training.

責任著者(Corresponding author)

会議情報

J-STAGEへの登録はこちら（無料）