single-jc.php

JACIII Vol.27 No.6 pp. 1096-1107
doi: 10.20965/jaciii.2023.p1096
(2023)

Research Paper:

Real-Time Dynamic Gesture Recognition Algorithm Based on Adaptive Information Fusion and Multi-Scale Optimization Transformer

Guangda Lu*,**, Wenhao Sun*,**,† ORCID Icon, Zhuanping Qin*,**, and Tinghang Guo*,**

*School of Automation and Electrical Engineering, Tianjin University of Technology and Education
No.1310 Dagu South Road, Hexi District, Tianjin 300222, China

**Tianjin Key Laboratory of Information Sensing & Intelligent Control
No.1310 Dagu South Road, Hexi District, Tianjin 300222, China

Corresponding author

Received:
April 4, 2023
Accepted:
July 9, 2023
Published:
November 20, 2023
Keywords:
dynamic gesture recognition, Transformer, optical flow, information fusion
Abstract

Gesture recognition is a popular technology in the field of computer vision and an important technical mean of achieving human-computer interaction. To address problems such as the limited long-range feature extraction capability of existing dynamic gesture recognition networks based on convolutional operators, we propose a dynamic gesture recognition algorithm based on spatial pyramid pooling Transformer and optical flow information fusion. We take advantage of Transformer’s large receptive field to reduce model computation while improving the model’s ability to extract features at different scales by embedding spatial pyramid pooling. We use the optical flow algorithm with the global motion aggregation module to obtain an optical flow map of hand motion, and to extract the key frames based on the similarity minimization principle. We also design an adaptive feature fusion method to fuse the spatial and temporal features of the dual channels. Finally, we demonstrate the effectiveness of model components on model recognition enhancement through ablation experiments. We conduct training and validation on the SCUT-DHGA dynamic gesture dataset and on a dataset we collected, and we perform real-time dynamic gesture recognition tests using the trained model. The results show that our algorithm achieves high accuracy even while keeping the parameters balanced. It also achieves fast and accurate recognition of dynamic gestures in real-time tests.

Cite this article as:
G. Lu, W. Sun, Z. Qin, and T. Guo, “Real-Time Dynamic Gesture Recognition Algorithm Based on Adaptive Information Fusion and Multi-Scale Optimization Transformer,” J. Adv. Comput. Intell. Intell. Inform., Vol.27 No.6, pp. 1096-1107, 2023.
Data files:
References
  1. [1] A. Carfì and F. Mastrogiovanni, “Gesture-based human–machine interaction: Taxonomy, problem definition, and analysis,” IEEE Trans. on Cybernetics, Vol.53, No.1, pp. 497-513, 2023. https://doi.org/10.1109/TCYB.2021.3129119
  2. [2] X. Lu et al., “Development of a wearable gesture recognition system based on two-terminal electrical impedance tomography,” IEEE J. of Biomedical and Health Informatics, Vol.26, No.6, pp. 2515-2523, 2022. https://doi.org/10.1109/JBHI.2021.3130374
  3. [3] S. Shin et al., “Hand gesture recognition using EGaIn-silicone soft sensors,” Sensors, Vol.21, No.9, Article No.3204, 2021. https://doi.org/10.3390/s21093204
  4. [4] G. Benitez-Garcia et al., “Improving real-time hand gesture recognition with semantic segmentation,” Sensors, Vol.21, No.2, Article No.356, 2021. https://doi.org/10.3390/s21020356
  5. [5] G. Krishnan et al., “Spatio-temporal continuous gesture recognition under degraded environments: Performance comparison between 3D integral imaging (InIm) and RGB-D sensors,” Optics Express, Vol.29, No.19, pp. 30937-30951, 2021. https://doi.org/10.1364/OE.438110
  6. [6] D. C. Silpani, K. Suematsu, and K. Yoshida, “A feasibility study on hand gesture intention interpretation based on gesture detection and speech recognition,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.3, pp. 375-381, 2022. https://doi.org/10.20965/jaciii.2022.p0375
  7. [7] K. M. Vamsikrishna, D. P. Dogra, and M. S. Desarkar, “Computer-vision-assisted palm rehabilitation with supervised learning,” IEEE Trans. on Biomedical Engineering, Vol.63, No.5, pp. 991-1001, 2016. https://doi.org/10.1109/TBME.2015.2480881
  8. [8] C. I. Patel et al., “Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences,” Sensors, Vol.20, No.24, Article No.7299, 2020. https://doi.org/10.3390/s20247299
  9. [9] W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification: A comprehensive review,” Neural Computation, Vol.29, No.9, pp. 2352-2449, 2017. https://doi.org/10.1162/neco_a_00990
  10. [10] J. Yu, M. Qin, and S. Zhou, “Dynamic gesture recognition based on 2D convolutional neural network and feature fusion,” Scientific Reports, Vol.12, No.1, Article No.4345, 2022. https://doi.org/10.1038/s41598-022-08133-z
  11. [11] J. Lin et al., “Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition,” Sensors, Vol.16, No.12, Article No.2171, 2016. https://doi.org/10.3390/s16122171
  12. [12] Y. Liu et al., “Dynamic gesture recognition algorithm based on 3D convolutional neural network,” Computational Intelligence and Neuroscience, Vol.2021, Article No.4828102, 2021. https://doi.org/10.1155/2021/4828102
  13. [13] Y.-X. Wang et al., “Multitask touch gesture and emotion recognition using multiscale spatiotemporal convolutions with attention mechanism,” IEEE Sensors J., Vol.22, No.16, pp. 16190-16201, 2022. https://doi.org/10.1109/JSEN.2022.3187776
  14. [14] M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1165-1174, 2019. https://doi.org/10.1109/CVPR.2019.00126
  15. [15] A. Vaswani et al., “Attention is all you need,” Proc. of the 31st Int. Conf. on Neural Information Processing Systems (NIPS’17), pp. 6000-6010, 2017.
  16. [16] A. Galassi, M. Lippi, and P. Torroni, “Attention in natural language processing,” IEEE Trans. on Neural Networks and Learning Systems, Vol.32, No.10, pp. 4291-4308, 2021. https://doi.org/10.1109/TNNLS.2020.3019893
  17. [17] K. Han et al., “A survey on vision transformer,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.1, pp. 87-110, 2023. https://doi.org/10.1109/TPAMI.2022.3152247
  18. [18] W. Sun et al., “Regional time-series coding network and multi-view image generation network for short-time gait recognition,” Entropy, Vol.25, No.6, Article No.837, 2023. https://doi.org/10.3390/e25060837
  19. [19] Y. Liu et al., “A survey of visual transformers,” IEEE Trans. on Neural Networks and Learning Systems, 2023. https://doi.org/10.1109/TNNLS.2022.3227717
  20. [20] “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 9992-10002, 2021. https://doi.org/10.1109/ICCV48922.2021.00986
  21. [21] Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the ResNet model for visual recognition,” Pattern Recognition, Vol.90, pp. 119-133, 2019. https://doi.org/10.1016/j.patcog.2019.01.006
  22. [22] Y.-H. Wu et al., “P2T: Pyramid pooling transformer for scene understanding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022. https://doi.org/10.1109/TPAMI.2022.3202765
  23. [23] S. S. Beauchemin and J. L. Barron, “The computation of optical flow,” ACM Computing Surveys, Vol.27, No.3, pp. 433-466, 1995. https://doi.org/10.1145/212094.212141
  24. [24] Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” Proc. of 16th European Conf. on Computer Vision (ECCV 2020), pp. 402-419, 2020. https://doi.org/10.1007/978-3-030-58536-5_24
  25. [25] J. Chung et al., “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv:1412.3555, 2014. https://doi.org/10.48550/arXiv.1412.3555
  26. [26] S. Jiang et al., “Learning to estimate hidden motions with global motion aggregation,” 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 9752-9761, 2021. https://doi.org/10.1109/ICCV48922.2021.00963
  27. [27] Z. Rasheed and M. Shah, “Detection and representation of scenes in videos,” IEEE Trans. on Multimedia, Vol.7, No.6, pp. 1097-1105, 2005. https://doi.org/10.1109/TMM.2005.858392
  28. [28] M. Mentzelopoulos and A. Psarrou, “Key-frame extraction algorithm using entropy difference,” Proc. of the 6th ACM SIGMM Int. Workshop on Multimedia Information Retrieval (MIR’04), pp. 39-45, 2004. https://doi.org/10.1145/1026711.1026719
  29. [29] Y. Zhuang et al., “Adaptive key frame extraction using unsupervised clustering,” Proc. 1998 Int. Conf. on Image Processing (ICIP98), Vol.1, pp. 866-870, 1998. https://doi.org/10.1109/ICIP.1998.723655
  30. [30] C. Liu et al., “Dynamic-hand-gesture authentication dataset and benchmark,” IEEE Trans. on Information Forensics and Security, Vol.16, pp. 1550-1562, 2021. https://doi.org/10.1109/TIFS.2020.3036218
  31. [31] W. Song et al., “TDS-Net: Towards fast dynamic random hand gesture authentication via temporal difference symbiotic neural network,” 2021 IEEE Int. Joint Conf. on Biometrics (IJCB), 2021. https://doi.org/10.1109/IJCB52358.2021.9484390
  32. [32] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv: 2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Apr. 22, 2024