Host: The Japanese Society for Artificial Intelligence
Name : The 39th Annual Conference of the Japanese Society for Artificial Intelligence
Number : 39
Location : [in Japanese]
Date : May 27, 2025 - May 30, 2025
There is a growing demand for behavior analysis of workers in automobile manufacturing to automate the monitoring of compliance with work procedures and the measurement of each task's duration. Previous methods using deep neural networks for behavior analysis require frame-by-frame labels of videos for training through supervised learning, resulting in a shortage of labeled data becoming a significant challenge. On the other hand, in recent years, Vision and Language Models (VLMs), which acquire shared embeddings between images and text through large-scale pretraining, have attracted attention as a type of foundation model. By leveraging VLMs, it is becoming possible to build models more efficiently, even in domains that traditionally required large amounts of labeled training data. Therefore, this study proposes a method utilizing the language modality by applying CLIP (Contrastive Language-Image Pre-training), one of representative VLMs, to behavior analysis in automobile assembly videos. In particular, this study verifies whether leveraging the language modality enables the construction of a model with a small amount of labeled training data.