Proceedings of the Annual Conference of JSAI
Online ISSN : 2758-7347
39th (2025)
Session ID : 3N6-GS-7-02
Conference information

A Preliminary Study on Behavioral Analysis Using Vision and Language Foundation Model for Automobile Assembly Work Videos
*Koki KIYOTAKanta KUBOAsuka HISATOMIHirotaka ITOYuta HIGASHIZONOSatoshi ONO
Author information
CONFERENCE PROCEEDINGS FREE ACCESS

Details
Abstract

There is a growing demand for behavior analysis of workers in automobile manufacturing to automate the monitoring of compliance with work procedures and the measurement of each task's duration. Previous methods using deep neural networks for behavior analysis require frame-by-frame labels of videos for training through supervised learning, resulting in a shortage of labeled data becoming a significant challenge. On the other hand, in recent years, Vision and Language Models (VLMs), which acquire shared embeddings between images and text through large-scale pretraining, have attracted attention as a type of foundation model. By leveraging VLMs, it is becoming possible to build models more efficiently, even in domains that traditionally required large amounts of labeled training data. Therefore, this study proposes a method utilizing the language modality by applying CLIP (Contrastive Language-Image Pre-training), one of representative VLMs, to behavior analysis in automobile assembly videos. In particular, this study verifies whether leveraging the language modality enables the construction of a model with a small amount of labeled training data.

Content from these authors
© 2025 The Japanese Society for Artificial Intelligence
Previous article Next article
feedback
Top