マルチモーダル大規模言語モデルを用いたキャプション自動生成による監視映像異常検知

橋本 慧志; 西村 仁志; 黒川 茂莉

doi:10.2493/jjspe.91.1136

Abstract

Recently, state-of-the-art performance in video anomaly detection has been achieved by fine-tuning multimodal large language models (MLLM). However, the necessity of extensive caption annotations in training data imposes significant practical constraints. To overcome this limitation, we propose a novel MLLM-based video anomaly detection method that does not require manual caption annotation. The proposed method consists of an anomaly detection model for identifying and selecting key video samples, and an MLLM that autonomously generates and enhances captions to explain anomalous events. Extensive experiments demonstrate that our method achieves high detection accuracy and generates task-specific explanatory descriptions effectively.

Content from these authors

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!