自動車組立作業映像におけるVision and Language基盤モデルを利用した行動解析に関する基礎検討

清田 航暉; 久保 莞太; 久冨 あすか; 伊藤 浩隆; 東園 雄太; 小野 智司

doi:10.11517/pjsai.JSAI2025.0_3N6GS702

39th (2025)

Session ID : 3N6-GS-7-02

DOI https://doi.org/10.11517/pjsai.JSAI2025.0_3N6GS702

Conference information

Host: The Japanese Society for Artificial Intelligence

Name : The 39th Annual Conference of the Japanese Society for Artificial Intelligence

Number : 39

Location : [in Japanese]

Date : May 27, 2025 - May 30, 2025

A Preliminary Study on Behavioral Analysis Using Vision and Language Foundation Model for Automobile Assembly Work Videos

*Koki KIYOTA, Kanta KUBO, Asuka HISATOMI, Hirotaka ITO, Yuta HIGASHIZONO, Satoshi ONO

Author information

Keywords: Multimodal Foundation Model, Behavioral Analysis, Temporal Action Segmentation, Natural Language Processing, Video image processing

CONFERENCE PROCEEDINGS FREE ACCESS

Details

Abstract

There is a growing demand for behavior analysis of workers in automobile manufacturing to automate the monitoring of compliance with work procedures and the measurement of each task's duration. Previous methods using deep neural networks for behavior analysis require frame-by-frame labels of videos for training through supervised learning, resulting in a shortage of labeled data becoming a significant challenge. On the other hand, in recent years, Vision and Language Models (VLMs), which acquire shared embeddings between images and text through large-scale pretraining, have attracted attention as a type of foundation model. By leveraging VLMs, it is becoming possible to build models more efficiently, even in domains that traditionally required large amounts of labeled training data. Therefore, this study proposes a method utilizing the language modality by applying CLIP (Contrastive Language-Image Pre-training), one of representative VLMs, to behavior analysis in automobile assembly videos. In particular, this study verifies whether leveraging the language modality enables the construction of a model with a small amount of labeled training data.

Corresponding author

Conference information

Register with J-STAGE for free!