2025 Volume 29 Issue 3 Pages 659-667
Video anomaly detection is crucial in intelligent surveillance, yet the scarcity and diversity of abnormal events pose significant challenges for supervised methods. This paper presents an unsupervised framework that integrates graph attention networks (GATs) and Transformer architectures, combining masked autoencoders (MAEs) with self-distillation training. GATs are utilized to model spatial and inter-frame relationships, while Transformers capture long-range temporal dependencies, overcoming the limitations of traditional MAE and self-distillation approaches. The model employs a two-stage training process: first, a lightweight MAE combined with a GAT-Transformer fusion constructs a knowledge distillation module; second, the student autoencoder is optimized by integrating a graph convolutional autoencoder and a classification head to identify synthetic anomalies. We evaluate the proposed method on three representative datasets—ShanghaiTech Campus, UBnormal, and UCSD Ped2—and achieve promising results.
This article cannot obtain the latest cited-by information.