2025 Volume 91 Issue 12 Pages 1136-1143
Recently, state-of-the-art performance in video anomaly detection has been achieved by fine-tuning multimodal large language models (MLLM). However, the necessity of extensive caption annotations in training data imposes significant practical constraints. To overcome this limitation, we propose a novel MLLM-based video anomaly detection method that does not require manual caption annotation. The proposed method consists of an anomaly detection model for identifying and selecting key video samples, and an MLLM that autonomously generates and enhances captions to explain anomalous events. Extensive experiments demonstrate that our method achieves high detection accuracy and generates task-specific explanatory descriptions effectively.