Real-Time Dynamic Gesture Recognition Algorithm Based on Adaptive Information Fusion and Multi-Scale Optimization Transformer

Guangda Lu; Wenhao Sun; Zhuanping Qin; Tinghang Guo

doi:10.20965/jaciii.2023.p1096

single-jc.php

« previous

JACIII Vol.27 No.6 pp. 1096-1107

doi: 10.20965/jaciii.2023.p1096

(2023)

Research Paper:

Views over last 60 days: 442

Real-Time Dynamic Gesture Recognition Algorithm Based on Adaptive Information Fusion and Multi-Scale Optimization Transformer

Guangda Lu^*,**, Wenhao Sun^*,**,† , Zhuanping Qin^*,**, and Tinghang Guo^*,**

^*School of Automation and Electrical Engineering, Tianjin University of Technology and Education
No.1310 Dagu South Road, Hexi District, Tianjin 300222, China

^**Tianjin Key Laboratory of Information Sensing & Intelligent Control
No.1310 Dagu South Road, Hexi District, Tianjin 300222, China

^†Corresponding author

Received:

April 4, 2023

Accepted:

July 9, 2023

Published:

November 20, 2023

Keywords:

dynamic gesture recognition, Transformer, optical flow, information fusion

Abstract

Gesture recognition is a popular technology in the field of computer vision and an important technical mean of achieving human-computer interaction. To address problems such as the limited long-range feature extraction capability of existing dynamic gesture recognition networks based on convolutional operators, we propose a dynamic gesture recognition algorithm based on spatial pyramid pooling Transformer and optical flow information fusion. We take advantage of Transformer’s large receptive field to reduce model computation while improving the model’s ability to extract features at different scales by embedding spatial pyramid pooling. We use the optical flow algorithm with the global motion aggregation module to obtain an optical flow map of hand motion, and to extract the key frames based on the similarity minimization principle. We also design an adaptive feature fusion method to fuse the spatial and temporal features of the dual channels. Finally, we demonstrate the effectiveness of model components on model recognition enhancement through ablation experiments. We conduct training and validation on the SCUT-DHGA dynamic gesture dataset and on a dataset we collected, and we perform real-time dynamic gesture recognition tests using the trained model. The results show that our algorithm achieves high accuracy even while keeping the parameters balanced. It also achieves fast and accurate recognition of dynamic gestures in real-time tests.

Cite this article as:

G. Lu, W. Sun, Z. Qin, and T. Guo, “Real-Time Dynamic Gesture Recognition Algorithm Based on Adaptive Information Fusion and Multi-Scale Optimization Transformer,” J. Adv. Comput. Intell. Intell. Inform., Vol.27 No.6, pp. 1096-1107, 2023.

Data files:

References

[1] A. Carfì and F. Mastrogiovanni, “Gesture-based human–machine interaction: Taxonomy, problem definition, and analysis,” IEEE Trans. on Cybernetics, Vol.53, No.1, pp. 497-513, 2023. https://doi.org/10.1109/TCYB.2021.3129119
[2] X. Lu et al., “Development of a wearable gesture recognition system based on two-terminal electrical impedance tomography,” IEEE J. of Biomedical and Health Informatics, Vol.26, No.6, pp. 2515-2523, 2022. https://doi.org/10.1109/JBHI.2021.3130374
[3] S. Shin et al., “Hand gesture recognition using EGaIn-silicone soft sensors,” Sensors, Vol.21, No.9, Article No.3204, 2021. https://doi.org/10.3390/s21093204
[4] G. Benitez-Garcia et al., “Improving real-time hand gesture recognition with semantic segmentation,” Sensors, Vol.21, No.2, Article No.356, 2021. https://doi.org/10.3390/s21020356
[5] G. Krishnan et al., “Spatio-temporal continuous gesture recognition under degraded environments: Performance comparison between 3D integral imaging (InIm) and RGB-D sensors,” Optics Express, Vol.29, No.19, pp. 30937-30951, 2021. https://doi.org/10.1364/OE.438110
[6] D. C. Silpani, K. Suematsu, and K. Yoshida, “A feasibility study on hand gesture intention interpretation based on gesture detection and speech recognition,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.3, pp. 375-381, 2022. https://doi.org/10.20965/jaciii.2022.p0375
[7] K. M. Vamsikrishna, D. P. Dogra, and M. S. Desarkar, “Computer-vision-assisted palm rehabilitation with supervised learning,” IEEE Trans. on Biomedical Engineering, Vol.63, No.5, pp. 991-1001, 2016. https://doi.org/10.1109/TBME.2015.2480881
[8] C. I. Patel et al., “Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences,” Sensors, Vol.20, No.24, Article No.7299, 2020. https://doi.org/10.3390/s20247299
[9] W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification: A comprehensive review,” Neural Computation, Vol.29, No.9, pp. 2352-2449, 2017. https://doi.org/10.1162/neco_a_00990
[10] J. Yu, M. Qin, and S. Zhou, “Dynamic gesture recognition based on 2D convolutional neural network and feature fusion,” Scientific Reports, Vol.12, No.1, Article No.4345, 2022. https://doi.org/10.1038/s41598-022-08133-z
[11] J. Lin et al., “Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition,” Sensors, Vol.16, No.12, Article No.2171, 2016. https://doi.org/10.3390/s16122171
[12] Y. Liu et al., “Dynamic gesture recognition algorithm based on 3D convolutional neural network,” Computational Intelligence and Neuroscience, Vol.2021, Article No.4828102, 2021. https://doi.org/10.1155/2021/4828102
[13] Y.-X. Wang et al., “Multitask touch gesture and emotion recognition using multiscale spatiotemporal convolutions with attention mechanism,” IEEE Sensors J., Vol.22, No.16, pp. 16190-16201, 2022. https://doi.org/10.1109/JSEN.2022.3187776
[14] M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1165-1174, 2019. https://doi.org/10.1109/CVPR.2019.00126
[15] A. Vaswani et al., “Attention is all you need,” Proc. of the 31st Int. Conf. on Neural Information Processing Systems (NIPS’17), pp. 6000-6010, 2017.
[16] A. Galassi, M. Lippi, and P. Torroni, “Attention in natural language processing,” IEEE Trans. on Neural Networks and Learning Systems, Vol.32, No.10, pp. 4291-4308, 2021. https://doi.org/10.1109/TNNLS.2020.3019893
[17] K. Han et al., “A survey on vision transformer,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.1, pp. 87-110, 2023. https://doi.org/10.1109/TPAMI.2022.3152247
[18] W. Sun et al., “Regional time-series coding network and multi-view image generation network for short-time gait recognition,” Entropy, Vol.25, No.6, Article No.837, 2023. https://doi.org/10.3390/e25060837
[19] Y. Liu et al., “A survey of visual transformers,” IEEE Trans. on Neural Networks and Learning Systems, 2023. https://doi.org/10.1109/TNNLS.2022.3227717
[20] “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 9992-10002, 2021. https://doi.org/10.1109/ICCV48922.2021.00986
[21] Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the ResNet model for visual recognition,” Pattern Recognition, Vol.90, pp. 119-133, 2019. https://doi.org/10.1016/j.patcog.2019.01.006
[22] Y.-H. Wu et al., “P2T: Pyramid pooling transformer for scene understanding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022. https://doi.org/10.1109/TPAMI.2022.3202765
[23] S. S. Beauchemin and J. L. Barron, “The computation of optical flow,” ACM Computing Surveys, Vol.27, No.3, pp. 433-466, 1995. https://doi.org/10.1145/212094.212141
[24] Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” Proc. of 16th European Conf. on Computer Vision (ECCV 2020), pp. 402-419, 2020. https://doi.org/10.1007/978-3-030-58536-5_24
[25] J. Chung et al., “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv:1412.3555, 2014. https://doi.org/10.48550/arXiv.1412.3555
[26] S. Jiang et al., “Learning to estimate hidden motions with global motion aggregation,” 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 9752-9761, 2021. https://doi.org/10.1109/ICCV48922.2021.00963
[27] Z. Rasheed and M. Shah, “Detection and representation of scenes in videos,” IEEE Trans. on Multimedia, Vol.7, No.6, pp. 1097-1105, 2005. https://doi.org/10.1109/TMM.2005.858392
[28] M. Mentzelopoulos and A. Psarrou, “Key-frame extraction algorithm using entropy difference,” Proc. of the 6th ACM SIGMM Int. Workshop on Multimedia Information Retrieval (MIR’04), pp. 39-45, 2004. https://doi.org/10.1145/1026711.1026719
[29] Y. Zhuang et al., “Adaptive key frame extraction using unsupervised clustering,” Proc. 1998 Int. Conf. on Image Processing (ICIP98), Vol.1, pp. 866-870, 1998. https://doi.org/10.1109/ICIP.1998.723655
[30] C. Liu et al., “Dynamic-hand-gesture authentication dataset and benchmark,” IEEE Trans. on Information Forensics and Security, Vol.16, pp. 1550-1562, 2021. https://doi.org/10.1109/TIFS.2020.3036218
[31] W. Song et al., “TDS-Net: Towards fast dynamic random hand gesture authentication via temporal difference symbiotic neural network,” 2021 IEEE Int. Joint Conf. on Biometrics (IJCB), 2021. https://doi.org/10.1109/IJCB52358.2021.9484390
[32] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv: 2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] A. Carfì and F. Mastrogiovanni, “Gesture-based human–machine interaction: Taxonomy, problem definition, and analysis,” IEEE Trans. on Cybernetics, Vol.53, No.1, pp. 497-513, 2023. https://doi.org/10.1109/TCYB.2021.3129119

[2] [2] X. Lu et al., “Development of a wearable gesture recognition system based on two-terminal electrical impedance tomography,” IEEE J. of Biomedical and Health Informatics, Vol.26, No.6, pp. 2515-2523, 2022. https://doi.org/10.1109/JBHI.2021.3130374

[3] [3] S. Shin et al., “Hand gesture recognition using EGaIn-silicone soft sensors,” Sensors, Vol.21, No.9, Article No.3204, 2021. https://doi.org/10.3390/s21093204

[4] [4] G. Benitez-Garcia et al., “Improving real-time hand gesture recognition with semantic segmentation,” Sensors, Vol.21, No.2, Article No.356, 2021. https://doi.org/10.3390/s21020356

[5] [5] G. Krishnan et al., “Spatio-temporal continuous gesture recognition under degraded environments: Performance comparison between 3D integral imaging (InIm) and RGB-D sensors,” Optics Express, Vol.29, No.19, pp. 30937-30951, 2021. https://doi.org/10.1364/OE.438110

[6] [6] D. C. Silpani, K. Suematsu, and K. Yoshida, “A feasibility study on hand gesture intention interpretation based on gesture detection and speech recognition,” J. Adv. Comput. Intell. Intell. Inform., Vol.26, No.3, pp. 375-381, 2022. https://doi.org/10.20965/jaciii.2022.p0375

[7] [7] K. M. Vamsikrishna, D. P. Dogra, and M. S. Desarkar, “Computer-vision-assisted palm rehabilitation with supervised learning,” IEEE Trans. on Biomedical Engineering, Vol.63, No.5, pp. 991-1001, 2016. https://doi.org/10.1109/TBME.2015.2480881

[8] [8] C. I. Patel et al., “Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences,” Sensors, Vol.20, No.24, Article No.7299, 2020. https://doi.org/10.3390/s20247299

[9] [9] W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification: A comprehensive review,” Neural Computation, Vol.29, No.9, pp. 2352-2449, 2017. https://doi.org/10.1162/neco_a_00990

[10] [10] J. Yu, M. Qin, and S. Zhou, “Dynamic gesture recognition based on 2D convolutional neural network and feature fusion,” Scientific Reports, Vol.12, No.1, Article No.4345, 2022. https://doi.org/10.1038/s41598-022-08133-z

[11] [11] J. Lin et al., “Adaptive local spatiotemporal features from RGB-D data for one-shot learning gesture recognition,” Sensors, Vol.16, No.12, Article No.2171, 2016. https://doi.org/10.3390/s16122171

[12] [12] Y. Liu et al., “Dynamic gesture recognition algorithm based on 3D convolutional neural network,” Computational Intelligence and Neuroscience, Vol.2021, Article No.4828102, 2021. https://doi.org/10.1155/2021/4828102

[13] [13] Y.-X. Wang et al., “Multitask touch gesture and emotion recognition using multiscale spatiotemporal convolutions with attention mechanism,” IEEE Sensors J., Vol.22, No.16, pp. 16190-16201, 2022. https://doi.org/10.1109/JSEN.2022.3187776

[14] [14] M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1165-1174, 2019. https://doi.org/10.1109/CVPR.2019.00126

[15] [15] A. Vaswani et al., “Attention is all you need,” Proc. of the 31st Int. Conf. on Neural Information Processing Systems (NIPS’17), pp. 6000-6010, 2017.

[16] [16] A. Galassi, M. Lippi, and P. Torroni, “Attention in natural language processing,” IEEE Trans. on Neural Networks and Learning Systems, Vol.32, No.10, pp. 4291-4308, 2021. https://doi.org/10.1109/TNNLS.2020.3019893

[17] [17] K. Han et al., “A survey on vision transformer,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.45, No.1, pp. 87-110, 2023. https://doi.org/10.1109/TPAMI.2022.3152247

[18] [18] W. Sun et al., “Regional time-series coding network and multi-view image generation network for short-time gait recognition,” Entropy, Vol.25, No.6, Article No.837, 2023. https://doi.org/10.3390/e25060837

[19] [19] Y. Liu et al., “A survey of visual transformers,” IEEE Trans. on Neural Networks and Learning Systems, 2023. https://doi.org/10.1109/TNNLS.2022.3227717

[20] [20] “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 9992-10002, 2021. https://doi.org/10.1109/ICCV48922.2021.00986

[21] [21] Z. Wu, C. Shen, and A. van den Hengel, “Wider or deeper: Revisiting the ResNet model for visual recognition,” Pattern Recognition, Vol.90, pp. 119-133, 2019. https://doi.org/10.1016/j.patcog.2019.01.006

[22] [22] Y.-H. Wu et al., “P2T: Pyramid pooling transformer for scene understanding,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2022. https://doi.org/10.1109/TPAMI.2022.3202765

[23] [23] S. S. Beauchemin and J. L. Barron, “The computation of optical flow,” ACM Computing Surveys, Vol.27, No.3, pp. 433-466, 1995. https://doi.org/10.1145/212094.212141

[24] [24] Z. Teed and J. Deng, “RAFT: Recurrent all-pairs field transforms for optical flow,” Proc. of 16th European Conf. on Computer Vision (ECCV 2020), pp. 402-419, 2020. https://doi.org/10.1007/978-3-030-58536-5_24

[25] [25] J. Chung et al., “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv:1412.3555, 2014. https://doi.org/10.48550/arXiv.1412.3555

[26] [26] S. Jiang et al., “Learning to estimate hidden motions with global motion aggregation,” 2021 IEEE/CVF Int. Conf. on Computer Vision (ICCV), pp. 9752-9761, 2021. https://doi.org/10.1109/ICCV48922.2021.00963

[27] [27] Z. Rasheed and M. Shah, “Detection and representation of scenes in videos,” IEEE Trans. on Multimedia, Vol.7, No.6, pp. 1097-1105, 2005. https://doi.org/10.1109/TMM.2005.858392

[28] [28] M. Mentzelopoulos and A. Psarrou, “Key-frame extraction algorithm using entropy difference,” Proc. of the 6th ACM SIGMM Int. Workshop on Multimedia Information Retrieval (MIR’04), pp. 39-45, 2004. https://doi.org/10.1145/1026711.1026719

[29] [29] Y. Zhuang et al., “Adaptive key frame extraction using unsupervised clustering,” Proc. 1998 Int. Conf. on Image Processing (ICIP98), Vol.1, pp. 866-870, 1998. https://doi.org/10.1109/ICIP.1998.723655

[30] [30] C. Liu et al., “Dynamic-hand-gesture authentication dataset and benchmark,” IEEE Trans. on Information Forensics and Security, Vol.16, pp. 1550-1562, 2021. https://doi.org/10.1109/TIFS.2020.3036218

[31] [31] W. Song et al., “TDS-Net: Towards fast dynamic random hand gesture authentication via temporal difference symbiotic neural network,” 2021 IEEE Int. Joint Conf. on Biometrics (IJCB), 2021. https://doi.org/10.1109/IJCB52358.2021.9484390

[32] [32] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv: 2010.11929, 2020. https://doi.org/10.48550/arXiv.2010.11929

Real-Time Dynamic Gesture Recognition Algorithm Based on Adaptive Information Fusion and Multi-Scale Optimization Transformer

Guangda Lu*,**, Wenhao Sun*,**,† , Zhuanping Qin*,**, and Tinghang Guo*,**

Guangda Lu^*,**, Wenhao Sun^*,**,† , Zhuanping Qin^*,**, and Tinghang Guo^*,**