GCN-Transformer Autoencoder with Knowledge Distillation for Unsupervised Video Anomaly Detection

Mingchao Yan; Yonghua Xiong; Jinhua She

doi:10.20965/jaciii.2025.p0659

single-jc.php

« previous

JACIII Vol.29 No.3 pp. 659-667

doi: 10.20965/jaciii.2025.p0659

(2025)

Research Paper:

Views over last 60 days: 114

GCN-Transformer Autoencoder with Knowledge Distillation for Unsupervised Video Anomaly Detection

Mingchao Yan^1,2,3, Yonghua Xiong^1,2,3,†, and Jinhua She^*4

^*1School of Automation, China University of Geosciences
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^*2Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^*3Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
No.388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^*4School of Engineering, Tokyo University of Technology
1404-1 Katakura, Hachioji, Tokyo 192-0982, Japan

^†Corresponding author

Received:

January 20, 2025

Accepted:

March 6, 2025

Published:

May 20, 2025

Keywords:

video anomaly detection, self-distillation, GAT-Transformer fusion, knowledge distillation, graph convolutional networks

Abstract

Video anomaly detection is crucial in intelligent surveillance, yet the scarcity and diversity of abnormal events pose significant challenges for supervised methods. This paper presents an unsupervised framework that integrates graph attention networks (GATs) and Transformer architectures, combining masked autoencoders (MAEs) with self-distillation training. GATs are utilized to model spatial and inter-frame relationships, while Transformers capture long-range temporal dependencies, overcoming the limitations of traditional MAE and self-distillation approaches. The model employs a two-stage training process: first, a lightweight MAE combined with a GAT-Transformer fusion constructs a knowledge distillation module; second, the student autoencoder is optimized by integrating a graph convolutional autoencoder and a classification head to identify synthetic anomalies. We evaluate the proposed method on three representative datasets—ShanghaiTech Campus, UBnormal, and UCSD Ped2—and achieve promising results.

Cite this article as:

M. Yan, Y. Xiong, and J. She, “GCN-Transformer Autoencoder with Knowledge Distillation for Unsupervised Video Anomaly Detection,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.3, pp. 659-667, 2025.

Data files:

References

[1] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6479-6488, 2018. https://doi.org/10.1109/CVPR.2018.00678
[2] C. Zhang et al., “Weakly supervised anomaly detection in videos considering the openness of events,” IEEE Trans. on Intelligent Transportation Systems, Vol.23, No.11, pp. 21687-21699, 2022. https://doi.org/10.1109/TITS.2022.3174088
[3] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regularity in video sequences,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 733-742, 2016. https://doi.org/10.1109/CVPR.2016.86
[4] A. Acsintoae et al., “UBnormal: New benchmark for supervised open-set video anomaly detection,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 20111-20121, 2022. https://doi.org/10.1109/CVPR52688.2022.01951
[5] L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” arXiv:2402.04857, 2024. https://doi.org/10.48550/arXiv.2402.04857
[6] K. He et al., “Masked autoencoders are scalable vision learners,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15979-15988, 2022. https://doi.org/10.1109/CVPR52688.2022.01553
[7] N.-C. Ristea et al., “Self-distilled masked auto-encoders are efficient video anomaly detectors,” 2024 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15984-15995, 2024. https://doi.org/10.1109/CVPR52733.2024.01513
[8] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv:1609.02907, 2016. https://doi.org/10.48550/arXiv.1609.02907
[9] A. Vaswani et al., “Attention is all you need,” Proc. of the 31st Int. Conf. on Neural Information Processing Sytstems, pp. 6000-6010, 2017.
[10] Y. Liu et al., “Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models,” ACM Computing Surveys, Vol.56, No.7, Article No.189, 2024. https://doi.org/10.1145/3645101
[11] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly detection in stacked RNN framework,” 2017 IEEE Int. Conf. on Computer Vision, pp. 341-349, 2017. https://doi.org/10.1109/ICCV.2017.45
[12] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.36, No.1, pp. 18-32, 2014. https://doi.org/10.1109/TPAMI.2013.111
[13] S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware video anomaly detection,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 22846-22856, 2023. https://doi.org/10.1109/CVPR52729.2023.02188
[14] P. Malhotra et al., “LSTM-based encoder-decoder for multi-sensor anomaly detection,” arXiv:1607.00148, 2016. https://doi.org/10.48550/arXiv.1607.00148
[15] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional LSTM for anomaly detection,” 2017 IEEE Int. Conf. on Multimedia and Expo, pp. 439-444, 2017. https://doi.org/10.1109/ICME.2017.8019325
[16] R. Morais et al., “Learning regularity in skeleton trajectories for anomaly detection in videos,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11988-11996, 2019. https://doi.org/10.1109/CVPR.2019.01227
[17] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 10536-10544, 2020. https://doi.org/10.1109/CVPR42600.2020.01055
[18] R. Wang et al., “Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6312-6322, 2023. https://doi.org/10.1109/CVPR52729.2023.00611
[19] M. Caron et al., “Emerging properties in self-supervised vision transformers,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 9630-9640, 2021. https://doi.org/10.1109/ICCV48922.2021.00951
[20] J.-B. Grill et al., “Bootstrap your own latent a new approach to self-supervised learning,” Proc. of the 34th Int. Conf. on Neural Information Processing Systems, pp. 21271-21284, 2020.
[21] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv:1503.02531, 2015. https://doi.org/10.48550/arXiv.1503.02531
[22] A. Romero et al., “FitNets: Hints for thin deep nets,” arXiv:1412.6550, 2014. https://doi.org/10.48550/arXiv.1412.6550
[23] Y. Chen, N. Wang, and Z. Zhang, “DarkRank: Accelerating deep metric learning via cross sample similarities transfer,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.32, No.1, pp. 2852-2859, 2018. https://doi.org/10.1609/aaai.v32i1.11783
[24] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 4182-4191, 2020. https://doi.org/10.1109/CVPR42600.2020.00424
[25] L. Liu, H. Wang, J. Lin, R. Socher, and C. Xiong, “MKD: A multi-task knowledge distillation approach for pretrained language models,” arXiv:1911.03588, 2019. https://doi.org/10.48550/arXiv.1911.03588
[26] X. Ji, B. Li, and Y. Zhu, “TAM-Net: Temporal enhanced appearance-to-motion generative network for video anomaly detection,” 2020 Int. Joint Conf. on Neural Networks, 2020. https://doi.org/10.1109/IJCNN48605.2020.9207231
[27] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Synthetic temporal anomaly guided end-to-end video anomaly detection,” 2021 IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 207-214, 2021. https://doi.org/10.1109/ICCVW54120.2021.00028
[28] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative framework for anomaly detection in large videos,” Proc. of the 14th European Conf. on Computer Vision, Part 5, pp. 334-349, 2016. https://doi.org/10.1007/978-3-319-46454-1_21
[29] D. Gong et al., “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” 2019 IEEE/CVF Int. Conf. on Computer Vision, pp. 1705-1714, 2019. https://doi.org/10.1109/ICCV.2019.00179
[30] Y. Tang et al., “Integrating prediction and reconstruction for anomaly detection,” Pattern Recognition Letters, Vol.129, pp. 123-130, 2020. https://doi.org/10.1016/j.patrec.2019.11.024
[31] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri, “Multi-timescale trajectory prediction for abnormal human activity detection,” 2020 IEEE Winter Conf. on Applications of Computer Vision, pp. 2615-2623, 2020. https://doi.org/10.1109/WACV45572.2020.9093633
[32] G. Yu et al., “Cloze test helps: Effective video anomaly detection via learning to complete video events,” Proc. of the 28th ACM Int. Conf. on Multimedia, pp. 583-591, 2020. https://doi.org/10.1145/3394171.3413973
[33] M.-I. Georgescu et al., “Anomaly detection in video via self-supervised and multi-task learning,” 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 12737-12747, 2021. https://doi.org/10.1109/CVPR46437.2021.01255
[34] A. Barbalau et al., “SSMTL++: Revisiting self-supervised multi-task learning for video anomaly detection,” Computer Vision and Image Understanding, Vol.229, Article No.103656, 2023. https://doi.org/10.1016/j.cviu.2023.103656
[35] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 813-824, 2021.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] W. Sultani, C. Chen, and M. Shah, “Real-world anomaly detection in surveillance videos,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6479-6488, 2018. https://doi.org/10.1109/CVPR.2018.00678

[2] [2] C. Zhang et al., “Weakly supervised anomaly detection in videos considering the openness of events,” IEEE Trans. on Intelligent Transportation Systems, Vol.23, No.11, pp. 21687-21699, 2022. https://doi.org/10.1109/TITS.2022.3174088

[3] [3] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis, “Learning temporal regularity in video sequences,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 733-742, 2016. https://doi.org/10.1109/CVPR.2016.86

[4] [4] A. Acsintoae et al., “UBnormal: New benchmark for supervised open-set video anomaly detection,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 20111-20121, 2022. https://doi.org/10.1109/CVPR52688.2022.01951

[5] [5] L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” arXiv:2402.04857, 2024. https://doi.org/10.48550/arXiv.2402.04857

[6] [6] K. He et al., “Masked autoencoders are scalable vision learners,” 2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15979-15988, 2022. https://doi.org/10.1109/CVPR52688.2022.01553

[7] [7] N.-C. Ristea et al., “Self-distilled masked auto-encoders are efficient video anomaly detectors,” 2024 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15984-15995, 2024. https://doi.org/10.1109/CVPR52733.2024.01513

[8] [8] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv:1609.02907, 2016. https://doi.org/10.48550/arXiv.1609.02907

[9] [9] A. Vaswani et al., “Attention is all you need,” Proc. of the 31st Int. Conf. on Neural Information Processing Sytstems, pp. 6000-6010, 2017.

[10] [10] Y. Liu et al., “Generalized video anomaly event detection: Systematic taxonomy and comparison of deep models,” ACM Computing Surveys, Vol.56, No.7, Article No.189, 2024. https://doi.org/10.1145/3645101

[11] [11] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based anomaly detection in stacked RNN framework,” 2017 IEEE Int. Conf. on Computer Vision, pp. 341-349, 2017. https://doi.org/10.1109/ICCV.2017.45

[12] [12] W. Li, V. Mahadevan, and N. Vasconcelos, “Anomaly detection and localization in crowded scenes,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.36, No.1, pp. 18-32, 2014. https://doi.org/10.1109/TPAMI.2013.111

[13] [13] S. Sun and X. Gong, “Hierarchical semantic contrast for scene-aware video anomaly detection,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 22846-22856, 2023. https://doi.org/10.1109/CVPR52729.2023.02188

[14] [14] P. Malhotra et al., “LSTM-based encoder-decoder for multi-sensor anomaly detection,” arXiv:1607.00148, 2016. https://doi.org/10.48550/arXiv.1607.00148

[15] [15] W. Luo, W. Liu, and S. Gao, “Remembering history with convolutional LSTM for anomaly detection,” 2017 IEEE Int. Conf. on Multimedia and Expo, pp. 439-444, 2017. https://doi.org/10.1109/ICME.2017.8019325

[16] [16] R. Morais et al., “Learning regularity in skeleton trajectories for anomaly detection in videos,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 11988-11996, 2019. https://doi.org/10.1109/CVPR.2019.01227

[17] [17] A. Markovitz, G. Sharir, I. Friedman, L. Zelnik-Manor, and S. Avidan, “Graph embedded pose clustering for anomaly detection,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 10536-10544, 2020. https://doi.org/10.1109/CVPR42600.2020.01055

[18] [18] R. Wang et al., “Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning,” 2023 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6312-6322, 2023. https://doi.org/10.1109/CVPR52729.2023.00611

[19] [19] M. Caron et al., “Emerging properties in self-supervised vision transformers,” 2021 IEEE/CVF Int. Conf. on Computer Vision, pp. 9630-9640, 2021. https://doi.org/10.1109/ICCV48922.2021.00951

[20] [20] J.-B. Grill et al., “Bootstrap your own latent a new approach to self-supervised learning,” Proc. of the 34th Int. Conf. on Neural Information Processing Systems, pp. 21271-21284, 2020.

[21] [21] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv:1503.02531, 2015. https://doi.org/10.48550/arXiv.1503.02531

[22] [22] A. Romero et al., “FitNets: Hints for thin deep nets,” arXiv:1412.6550, 2014. https://doi.org/10.48550/arXiv.1412.6550

[23] [23] Y. Chen, N. Wang, and Z. Zhang, “DarkRank: Accelerating deep metric learning via cross sample similarities transfer,” Proc. of the AAAI Conf. on Artificial Intelligence, Vol.32, No.1, pp. 2852-2859, 2018. https://doi.org/10.1609/aaai.v32i1.11783

[24] [24] P. Bergmann, M. Fauser, D. Sattlegger, and C. Steger, “Uninformed students: Student-teacher anomaly detection with discriminative latent embeddings,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 4182-4191, 2020. https://doi.org/10.1109/CVPR42600.2020.00424

[25] [25] L. Liu, H. Wang, J. Lin, R. Socher, and C. Xiong, “MKD: A multi-task knowledge distillation approach for pretrained language models,” arXiv:1911.03588, 2019. https://doi.org/10.48550/arXiv.1911.03588

[26] [26] X. Ji, B. Li, and Y. Zhu, “TAM-Net: Temporal enhanced appearance-to-motion generative network for video anomaly detection,” 2020 Int. Joint Conf. on Neural Networks, 2020. https://doi.org/10.1109/IJCNN48605.2020.9207231

[27] [27] M. Astrid, M. Z. Zaheer, and S.-I. Lee, “Synthetic temporal anomaly guided end-to-end video anomaly detection,” 2021 IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 207-214, 2021. https://doi.org/10.1109/ICCVW54120.2021.00028

[28] [28] A. Del Giorno, J. A. Bagnell, and M. Hebert, “A discriminative framework for anomaly detection in large videos,” Proc. of the 14th European Conf. on Computer Vision, Part 5, pp. 334-349, 2016. https://doi.org/10.1007/978-3-319-46454-1_21

[29] [29] D. Gong et al., “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” 2019 IEEE/CVF Int. Conf. on Computer Vision, pp. 1705-1714, 2019. https://doi.org/10.1109/ICCV.2019.00179

[30] [30] Y. Tang et al., “Integrating prediction and reconstruction for anomaly detection,” Pattern Recognition Letters, Vol.129, pp. 123-130, 2020. https://doi.org/10.1016/j.patrec.2019.11.024

[31] [31] R. Rodrigues, N. Bhargava, R. Velmurugan, and S. Chaudhuri, “Multi-timescale trajectory prediction for abnormal human activity detection,” 2020 IEEE Winter Conf. on Applications of Computer Vision, pp. 2615-2623, 2020. https://doi.org/10.1109/WACV45572.2020.9093633

[32] [32] G. Yu et al., “Cloze test helps: Effective video anomaly detection via learning to complete video events,” Proc. of the 28th ACM Int. Conf. on Multimedia, pp. 583-591, 2020. https://doi.org/10.1145/3394171.3413973

[33] [33] M.-I. Georgescu et al., “Anomaly detection in video via self-supervised and multi-task learning,” 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 12737-12747, 2021. https://doi.org/10.1109/CVPR46437.2021.01255

[34] [34] A. Barbalau et al., “SSMTL++: Revisiting self-supervised multi-task learning for video anomaly detection,” Computer Vision and Image Understanding, Vol.229, Article No.103656, 2023. https://doi.org/10.1016/j.cviu.2023.103656

[35] [35] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” Proc. of the 38th Int. Conf. on Machine Learning, pp. 813-824, 2021.

GCN-Transformer Autoencoder with Knowledge Distillation for Unsupervised Video Anomaly Detection

Mingchao Yan*1,*2,*3, Yonghua Xiong*1,*2,*3,†, and Jinhua She*4

Mingchao Yan^1,2,3, Yonghua Xiong^1,2,3,†, and Jinhua She^*4