single-jc.php

JACIII Vol.29 No.3 pp. 659-667
doi: 10.20965/jaciii.2025.p0659
(2025)

Research Paper:

GCN-Transformer Autoencoder with Knowledge Distillation for Unsupervised Video Anomaly Detection

Mingchao Yan*1,*2,*3, Yonghua Xiong*1,*2,*3,†, and Jinhua She*4

*1School of Automation, China University of Geosciences
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

*2Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

*3Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

*4School of Engineering, Tokyo University of Technology
1404-1 Katakura, Hachioji, Tokyo 192-0982, Japan

Corresponding author

Received:
January 20, 2025
Accepted:
March 6, 2025
Published:
May 20, 2025
Keywords:
video anomaly detection, self-distillation, GAT-Transformer fusion, knowledge distillation, graph convolutional networks
Abstract

Video anomaly detection is crucial in intelligent surveillance, yet the scarcity and diversity of abnormal events pose significant challenges for supervised methods. This paper presents an unsupervised framework that integrates graph attention networks (GATs) and Transformer architectures, combining masked autoencoders (MAEs) with self-distillation training. GATs are utilized to model spatial and inter-frame relationships, while Transformers capture long-range temporal dependencies, overcoming the limitations of traditional MAE and self-distillation approaches. The model employs a two-stage training process: first, a lightweight MAE combined with a GAT-Transformer fusion constructs a knowledge distillation module; second, the student autoencoder is optimized by integrating a graph convolutional autoencoder and a classification head to identify synthetic anomalies. We evaluate the proposed method on three representative datasets—ShanghaiTech Campus, UBnormal, and UCSD Ped2—and achieve promising results.

Cite this article as:
M. Yan, Y. Xiong, and J. She, “GCN-Transformer Autoencoder with Knowledge Distillation for Unsupervised Video Anomaly Detection,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.3, pp. 659-667, 2025.
Data files:
References
  1. [1] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6479-6488, 2018. https://doi.org/10.1109/CVPR.2018.00678
  2. [2] C. Zhang et al., “Weakly supervised anomaly detection in videos considering the openness of events,” IEEE Trans. on Intelligent Transportation Systems, Vol.23, No.11, pp. 21687-21699, 2022. https://doi.org/10.1109/TITS.2022.3174088
  3. [3] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regularity in video sequences,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 733-742, 2016. https://doi.org/10.1109/CVPR.2016.86
  4. [4] A. Acsintoae et al., “UBnormal: New benchmark for supervised open-set video anomaly detection,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 20111-20121, 2022. https://doi.org/10.1109/CVPR52688.2022.01951
  5. [5] L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” arXiv:2402.04857, 2024. https://doi.org/10.48550/arXiv.2402.04857
  6. [6] K. He et al., “Masked autoencoders are scalable vision learners,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15979-15988, 2022. https://doi.org/10.1109/CVPR52688.2022.01553
  7. [7] N.-C. Ristea et al., “Self-distilled masked auto-encoders are efficient video anomaly detectors,” 2024 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15984-15995, 2024. https://doi.org/10.1109/CVPR52733.2024.01513
  8. [8] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv:1609.02907, 2016. https://doi.org/10.48550/arXiv.1609.02907
  9. [9] A. Vaswani et al., “Attention is all you need,” Proc. of the 31st Int. Conf. on Neural Information Processing Sytstems, pp. 6000-6010, 2017.
  10. [10] Y. Liu et al., “Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models,” ACM Computing Surveys, Vol.56, No.7, Article No.189, 2024. https://doi.org/10.1145/3645101
  11. [11] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly detection in stacked RNN framework,” 2017 IEEE Int. Conf. on Computer Vision, pp. 341-349, 2017. https://doi.org/10.1109/ICCV.2017.45
  12. [12] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.36, No.1, pp. 18-32, 2014. https://doi.org/10.1109/TPAMI.2013.111
  13. [13] S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware video anomaly detection,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 22846-22856, 2023. https://doi.org/10.1109/CVPR52729.2023.02188
  14. [14] P. Malhotra et al., “LSTM-based encoder-decoder for multi-sensor anomaly detection,” arXiv:1607.00148, 2016. https://doi.org/10.48550/arXiv.1607.00148
  15. [15] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional LSTM for anomaly detection,” 2017 IEEE Int. Conf. on Multimedia and Expo, pp. 439-444, 2017. https://doi.org/10.1109/ICME.2017.8019325
  16. [16] R. Morais et al., “Learning regularity in skeleton trajectories for anomaly detection in videos,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11988-11996, 2019. https://doi.org/10.1109/CVPR.2019.01227
  17. [17] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 10536-10544, 2020. https://doi.org/10.1109/CVPR42600.2020.01055
  18. [18] R. Wang et al., “Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6312-6322, 2023. https://doi.org/10.1109/CVPR52729.2023.00611
  19. [19] M. Caron et al., “Emerging properties in self-supervised vision transformers,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 9630-9640, 2021. https://doi.org/10.1109/ICCV48922.2021.00951
  20. [20] J.-B. Grill et al., “Bootstrap your own latent a new approach to self-supervised learning,” Proc. of the 34th Int. Conf. on Neural Information Processing Systems, pp. 21271-21284, 2020.
  21. [21] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv:1503.02531, 2015. https://doi.org/10.48550/arXiv.1503.02531
  22. [22] A. Romero et al., “FitNets: Hints for thin deep nets,” arXiv:1412.6550, 2014. https://doi.org/10.48550/arXiv.1412.6550
  23. [23] Y. Chen, N. Wang, and Z. Zhang, “DarkRank: Accelerating deep metric learning via cross sample similarities transfer,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.32, No.1, pp. 2852-2859, 2018. https://doi.org/10.1609/aaai.v32i1.11783
  24. [24] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 4182-4191, 2020. https://doi.org/10.1109/CVPR42600.2020.00424
  25. [25] L. Liu, H. Wang, J. Lin, R. Socher, and C. Xiong, “MKD: A multi-task knowledge distillation approach for pretrained language models,” arXiv:1911.03588, 2019. https://doi.org/10.48550/arXiv.1911.03588
  26. [26] X. Ji, B. Li, and Y. Zhu, “TAM-Net: Temporal enhanced appearance-to-motion generative network for video anomaly detection,” 2020 Int. Joint Conf. on Neural Networks, 2020. https://doi.org/10.1109/IJCNN48605.2020.9207231
  27. [27] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Synthetic temporal anomaly guided end-to-end video anomaly detection,” 2021 IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 207-214, 2021. https://doi.org/10.1109/ICCVW54120.2021.00028
  28. [28] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative framework for anomaly detection in large videos,” Proc. of the 14th European Conf. on Computer Vision, Part 5, pp. 334-349, 2016. https://doi.org/10.1007/978-3-319-46454-1_21
  29. [29] D. Gong et al., “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” 2019 IEEE/CVF Int. Conf. on Computer Vision, pp. 1705-1714, 2019. https://doi.org/10.1109/ICCV.2019.00179
  30. [30] Y. Tang et al., “Integrating prediction and reconstruction for anomaly detection,” Pattern Recognition Letters, Vol.129, pp. 123-130, 2020. https://doi.org/10.1016/j.patrec.2019.11.024
  31. [31] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri, “Multi-timescale trajectory prediction for abnormal human activity detection,” 2020 IEEE Winter Conf. on Applications of Computer Vision, pp. 2615-2623, 2020. https://doi.org/10.1109/WACV45572.2020.9093633
  32. [32] G. Yu et al., “Cloze test helps: Effective video anomaly detection via learning to complete video events,” Proc. of the 28th ACM Int. Conf. on Multimedia, pp. 583-591, 2020. https://doi.org/10.1145/3394171.3413973
  33. [33] M.-I. Georgescu et al., “Anomaly detection in video via self-supervised and multi-task learning,” 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 12737-12747, 2021. https://doi.org/10.1109/CVPR46437.2021.01255
  34. [34] A. Barbalau et al., “SSMTL++: Revisiting self-supervised multi-task learning for video anomaly detection,” Computer Vision and Image Understanding, Vol.229, Article No.103656, 2023. https://doi.org/10.1016/j.cviu.2023.103656
  35. [35] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 813-824, 2021.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on May. 19, 2025