VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

Researchers introduced VigilFormer, a video anomaly detection framework combining deformable spatio-temporal attention with causal temporal modeling, achieving state-of-the-art AUC scores of 87.83% on UCF-Crime, 97.21% on ShanghaiTech, and 89.74% on CUHK Avenue at 41.5 FPS on a single GPU. The model uses a Deformable Spatio-Temporal Encoder to reduce computational cost and an Adaptive Confidence Scheduler to skip low-information frames, outperforming existing weakly-supervised methods in both accuracy and speed.

arXiv:2606.14724v1 Announce Type: new Abstract: Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder DSTE attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier CAC applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler ACS dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.