single-jc.php

JACIII Vol.29 No.5 pp. 1056-1061
doi: 10.20965/jaciii.2025.p1056
(2025)

Research Paper:

Research on Action Recognition Algorithm Based on SlowFast Network

Yinhao Xu and Yuanyao Lu ORCID Icon

School of Information Science and Technology, North China University of Technology
No.5 Jinyuanzhuang Road, Shijingshan District, Beijing 100144, China

Corresponding author

Received:
October 31, 2024
Accepted:
May 8, 2025
Published:
September 20, 2025
Keywords:
human action recognition, 3D-EfficientNet, lightweight
Abstract

As a major branch of video understanding, human action recognition has become a popular research topic in the field of computer vision and has a wide range of applications in many areas. To address the problems of high parameter consumption and weak spatiotemporal modeling capabilities in existing action recognition methods, this study proposes a lightweight dual-branch convolutional network called SlowFast-Light Net. Inspired and influenced by the renowned two-branch SlowFast network proposed by the expert Kaiming He, this study adopts a lightweight two-branch network design, which is an improvement based on the SlowFast network. The network significantly reduces parameter consumption by introducing a lightweight feature extraction network and accelerating the model convergence speed. This study conducts experimental verification on the UCF101 and HMDB51 datasets, achieving an action recognition accuracy of 93.80% and 80.00%, respectively, on the two test sets. The experimental results showed that the model proposed in this study achieved a recognition accuracy comparable to that of the original model with a considerably lower number of parameters.

Model structure diagram for action recognition algorithm SlowFast network

Model structure diagram for action recognition algorithm SlowFast network

Cite this article as:
Y. Xu and Y. Lu, “Research on Action Recognition Algorithm Based on SlowFast Network,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.5, pp. 1056-1061, 2025.
Data files:
References
  1. [1] S. Hoshino and K. Niimura, “Robot vision system for human detection and action recognition,” J. Adv. Comput. Intell. Intell. Inform., Vol.24, No.3, pp. 346-356, 2020. https://doi.org/10.20965/jaciii.2020.p0346
  2. [2] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast networks for video recognition,” 2019 IEEE/CVF Int. Conf. on Computer Vision, pp. 6201-6210, 2019. https://doi.org/10.1109/ICCV.2019.00630
  3. [3] X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6848-6856, 2018. https://doi.org/10.1109/CVPR.2018.00716
  4. [4] A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861, 2017. https://doi.org/10.48550/arXiv.1704.04861
  5. [5] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” Proc. of the 36th Int. Conf. on Machine Learning, pp. 6105-6114, 2019.
  6. [6] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Proc. of the 28th Int. Conf. on Neural Information Processing Systems, Vol.1, pp. 568-576, 2014.
  7. [7] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1933-1941, 2016. https://doi.org/10.1109/CVPR.2016.213
  8. [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” 2015 IEEE Int. Conf. on Computer Vision, pp. 4489-4497, 2015. https://doi.org/10.1109/ICCV.2015.510
  9. [9] Z. Qiu, T. Yao, and T. Mei, “Learning spatio-temporal representation with pseudo-3D residual networks,” 2017 IEEE Int. Conf. on Computer Vision, pp. 5534-5542, 2017. https://doi.org/10.1109/ICCV.2017.590
  10. [10] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the Kinetics dataset,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4724-4733, 2017. https://doi.org/10.1109/CVPR.2017.502
  11. [11] F. N. Iandola et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size,” arXiv:1602.07360, 2016. https://doi.org/10.48550/arXiv.1602.07360
  12. [12] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” Proc. of the 15th European Conf. on Computer Vision, Part 14, pp. 122-138, 2018. https://doi.org/10.1007/978-3-030-01264-9_8
  13. [13] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 4510-4520, 2018. https://doi.org/10.1109/CVPR.2018.00474
  14. [14] P. Wang, L. Li, F. Pan, and L. Wang, “Lightweight bilateral network for real-time semantic segmentation,” J. Adv. Comput. Intell. Intell. Inform., Vol.27, No.4, pp. 673-682, 2023. https://doi.org/10.20965/jaciii.2023.p0673
  15. [15] Z. Du, Y. Jin, H. Ma, and P. Liu, “A lightweight and accurate method for detecting traffic flow in real time,” J. Adv. Comput. Intell. Intell. Inform., Vol.27, No.6, pp. 1086-1095, 2023. https://doi.org/10.20965/jaciii.2023.p1086
  16. [16] Y. Li et al., “TEA: Temporal excitation and aggregation for action recognition,” 2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 906-915, 2020. https://doi.org/10.1109/CVPR42600.2020.00099
  17. [17] C. Gu et al., “AVA: A video dataset of spatio-temporally localized atomic visual actions,” 2018 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 6047-6056, 2018. https://doi.org/10.1109/CVPR.2018.00633
  18. [18] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv:1212.0402, 2012. https://doi.org/10.48550/arXiv.1212.0402
  19. [19] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” Proc. of the 32nd Int. Conf. on Machine Learning, pp. 448-456, 2015.
  20. [20] G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” arXiv:2102.05095, 2021. https://doi.org/10.48550/arXiv.2102.05095
  21. [21] D. Chen et al., “LS-VIT: Vision Transformer for action recognition based on long and short-term temporal difference,” Frontiers in Neurorobotics, Vol.18, Article No.1457843, 2024. https://doi.org/10.3389/fnbot.2024.1457843
  22. [22] O. Köpüklü, N. Kose, A. Gunduz, and G. Rigoll, “Resource efficient 3D convolutional neural networks,” 2019 IEEE/CVF Int. Conf. on Computer Vision Workshop, pp. 1910-1919, 2019. https://doi.org/10.1109/ICCVW.2019.00240
  23. [23] D. Wei et al., “Efficient dual attention SlowFast networks for video action recognition,” Computer Vision and Image Understanding, Vol.222, Article No.103484, 2022. https://doi.org/10.1016/j.cviu.2022.103484

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Sep. 19, 2025