A Multimodal Fusion Behaviors Estimation Method for Public Dangerous Monitoring

Renkai Hou; Xiangyang Xu; Yaping Dai; Shuai Shao; Kaoru Hirota

doi:10.20965/jaciii.2024.p0520

single-jc.php

« previous

JACIII Vol.28 No.3 pp. 520-527

doi: 10.20965/jaciii.2024.p0520

(2024)

Research Paper:

Views over last 60 days: 508

A Multimodal Fusion Behaviors Estimation Method for Public Dangerous Monitoring

Renkai Hou, Xiangyang Xu^†, Yaping Dai, Shuai Shao, and Kaoru Hirota

Beijing Institute of Technology
No.5 Zhongguancun South Street, Haidian District, Beijing 100081, China

^†Corresponding author

Received:

March 25, 2023

Accepted:

December 6, 2023

Published:

May 20, 2024

Keywords:

group behavior recognition, speech emotion recognition, multimodal fusion, deep learning

Abstract

At the present stage, the identification of dangerous behaviors in public places mostly relies on manual work, which is subjective and has low identification efficiency. This paper proposes an automatic identification method for dangerous behaviors in public places, which analyzes group behavior and speech emotion through deep learning network and then performs multimodal information fusion. Based on the fusion results, people can judge the emotional atmosphere of the crowd, make early warning, and alarm for possible dangerous behaviors. Experiments show that the algorithm adopted in this paper can accurately identify dangerous behaviors and has great application value.

Cite this article as:

R. Hou, X. Xu, Y. Dai, S. Shao, and K. Hirota, “A Multimodal Fusion Behaviors Estimation Method for Public Dangerous Monitoring,” J. Adv. Comput. Intell. Intell. Inform., Vol.28 No.3, pp. 520-527, 2024.

Data files:

References

[1] W.-C. Wang, C. S. Chien, and L. Moutinho, “Do You Really Feel Happy? Some Implications of Voice Emotion Response in Mandarin Chinese,” Marketing Letters, Vol.26, No.3, pp. 391-409, 2015. https://doi.org/10.1007/s11002-015-9357-y
[2] G. Johansson, “Visual perception of biological motion and a model for its analysis,” Perception & Psychophysics, Vol.14, No.2, pp. 201-211, 1973. https://doi.org/10.3758/BF03212378
[3] W. Choi and S. Savarese, “Understanding collective activities of people from videos,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.36, No.6, pp. 1242-1257, 2014. https://doi.org/10.1109/TPAMI.2013.220
[4] L.-C. Chen et al., “Learning deep structured models,” Proc. of the 32nd Int. Conf. on Machine Learning, pp. 1785-1794, 2015.
[5] Z. Wu, D. Lin, and X. Tang, “Deep Markov random field for image modeling,” Proc. of the 14th European Conf. on Computer Vision (ECCV 2016), Part VIII, pp. 295-312, 2016. https://doi.org/10.1007/978-3-319-46484-8_18
[6] M. R. Amer et al., “Cost-sensitive top-down/bottom-up inference for multiscale activity recognition,” Proc. of the 12th European Conf. on Computer Vision (ECCV 2012), Part IV, pp. 187-200, 2012. https://doi.org/10.1007/978-3-642-33765-9_14
[7] T. Shu et al., “Joint inference of groups, events and human roles in aerial videos,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4576-4584, 2015. https://doi.org/10.1109/CVPR.2015.7299088
[8] T. Bagautdinov et al., “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3425-3434, 2017. https://doi.org/10.1109/CVPR.2017.365
[9] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Proc. of the 27th Int. Conf. on Neural Information Processing Systems (NIPS’14), Vol.1, pp. 568-576, 2014.
[10] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1933-1941, 2016. https://doi.org/10.1109/CVPR.2016.213
[11] M. Wang, B. Ni, and X. Yang, “Recurrent modeling of interaction context for collective activity recognition,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 7408-7416, 2017. https://doi.org/10.1109/CVPR.2017.783
[12] J. Zhao et al., “Cognitive psychology-based artificial intelligence review,” Frontiers in Neuroscience, Vol.16, Article No.1024316, 2022. https://doi.org/10.3389/fnins.2022.1024316
[13] S. Ghosh et al., “Representation learning for speech emotion recognition,” Proc. of the 17th Annual Conf. of the Int. Speech Communication Association (Interspeech 2016), pp. 3603-3607, 2016. https://doi.org/10.21437/Interspeech.2016-692
[14] A. M. Badshah et al., “Deep features-based speech emotion recognition for smart affective services,” Multimedia Tools and Applications, Vol.78, No.5, pp. 5571-5589, 2019. https://doi.org/10.1007/s11042-017-5292-7
[15] L. A. Chris, “Recognizing human emotions using emotional transition lines in eigenspace,” Proc. of 2010 2nd Int. Conf. on Multimedia and Computational Intelligence (ICMCI 2010), pp. 316-319, 2010.
[16] P. Sreevidya, S. Veni, and O. V. R. Murthy, “Elder emotion classification through multimodal fusion of intermediate layers and cross-modal transfer learning,” Signal, Image and Video Processing, Vol.16, No.5, pp. 1281-1288, 2022. https://doi.org/10.1007/s11760-021-02079-x
[17] Z. Yan, C. Kou, and W. Ou, “Research of face anti-spoofing algorithm based on multi-modal fusion,” Computer Technology and Development, Vol.32, No.4, pp. 63-68+85, 2022 (in Chinese).
[18] A. Bhateja et al., “Depth analysis of Kinect v2 sensor in different mediums,” Multimedia Tools and Applications, Vol.81, No.25, pp. 35775-35800, 2022. https://doi.org/10.1007/s11042-021-11392-z
[19] T. Feng and S. Yang, “Speech emotion recognition based on lSTM and mel scale wavelet packet decomposition,” Proc. of the 2018 Int. Conf. on Algorithms, Computing and Artificial Intelligence (ACAI’18), Article No.38, 2018. https://doi.org/10.1145/3302425.3302444
[20] S. Tirronen, S. R. Kadiri, and P. Alku, “The effect of the MFCC frame length in automatic voice pathology detection,” J. of Voice, 2022. https://doi.org/10.1016/j.jvoice.2022.03.021

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] W.-C. Wang, C. S. Chien, and L. Moutinho, “Do You Really Feel Happy? Some Implications of Voice Emotion Response in Mandarin Chinese,” Marketing Letters, Vol.26, No.3, pp. 391-409, 2015. https://doi.org/10.1007/s11002-015-9357-y

[2] [2] G. Johansson, “Visual perception of biological motion and a model for its analysis,” Perception & Psychophysics, Vol.14, No.2, pp. 201-211, 1973. https://doi.org/10.3758/BF03212378

[3] [3] W. Choi and S. Savarese, “Understanding collective activities of people from videos,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.36, No.6, pp. 1242-1257, 2014. https://doi.org/10.1109/TPAMI.2013.220

[4] [4] L.-C. Chen et al., “Learning deep structured models,” Proc. of the 32nd Int. Conf. on Machine Learning, pp. 1785-1794, 2015.

[5] [5] Z. Wu, D. Lin, and X. Tang, “Deep Markov random field for image modeling,” Proc. of the 14th European Conf. on Computer Vision (ECCV 2016), Part VIII, pp. 295-312, 2016. https://doi.org/10.1007/978-3-319-46484-8_18

[6] [6] M. R. Amer et al., “Cost-sensitive top-down/bottom-up inference for multiscale activity recognition,” Proc. of the 12th European Conf. on Computer Vision (ECCV 2012), Part IV, pp. 187-200, 2012. https://doi.org/10.1007/978-3-642-33765-9_14

[7] [7] T. Shu et al., “Joint inference of groups, events and human roles in aerial videos,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4576-4584, 2015. https://doi.org/10.1109/CVPR.2015.7299088

[8] [8] T. Bagautdinov et al., “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3425-3434, 2017. https://doi.org/10.1109/CVPR.2017.365

[9] [9] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Proc. of the 27th Int. Conf. on Neural Information Processing Systems (NIPS’14), Vol.1, pp. 568-576, 2014.

[10] [10] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1933-1941, 2016. https://doi.org/10.1109/CVPR.2016.213

[11] [11] M. Wang, B. Ni, and X. Yang, “Recurrent modeling of interaction context for collective activity recognition,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 7408-7416, 2017. https://doi.org/10.1109/CVPR.2017.783

[12] [12] J. Zhao et al., “Cognitive psychology-based artificial intelligence review,” Frontiers in Neuroscience, Vol.16, Article No.1024316, 2022. https://doi.org/10.3389/fnins.2022.1024316

[13] [13] S. Ghosh et al., “Representation learning for speech emotion recognition,” Proc. of the 17th Annual Conf. of the Int. Speech Communication Association (Interspeech 2016), pp. 3603-3607, 2016. https://doi.org/10.21437/Interspeech.2016-692

[14] [14] A. M. Badshah et al., “Deep features-based speech emotion recognition for smart affective services,” Multimedia Tools and Applications, Vol.78, No.5, pp. 5571-5589, 2019. https://doi.org/10.1007/s11042-017-5292-7

[15] [15] L. A. Chris, “Recognizing human emotions using emotional transition lines in eigenspace,” Proc. of 2010 2nd Int. Conf. on Multimedia and Computational Intelligence (ICMCI 2010), pp. 316-319, 2010.

[16] [16] P. Sreevidya, S. Veni, and O. V. R. Murthy, “Elder emotion classification through multimodal fusion of intermediate layers and cross-modal transfer learning,” Signal, Image and Video Processing, Vol.16, No.5, pp. 1281-1288, 2022. https://doi.org/10.1007/s11760-021-02079-x

[17] [17] Z. Yan, C. Kou, and W. Ou, “Research of face anti-spoofing algorithm based on multi-modal fusion,” Computer Technology and Development, Vol.32, No.4, pp. 63-68+85, 2022 (in Chinese).

[18] [18] A. Bhateja et al., “Depth analysis of Kinect v2 sensor in different mediums,” Multimedia Tools and Applications, Vol.81, No.25, pp. 35775-35800, 2022. https://doi.org/10.1007/s11042-021-11392-z

[19] [19] T. Feng and S. Yang, “Speech emotion recognition based on lSTM and mel scale wavelet packet decomposition,” Proc. of the 2018 Int. Conf. on Algorithms, Computing and Artificial Intelligence (ACAI’18), Article No.38, 2018. https://doi.org/10.1145/3302425.3302444

[20] [20] S. Tirronen, S. R. Kadiri, and P. Alku, “The effect of the MFCC frame length in automatic voice pathology detection,” J. of Voice, 2022. https://doi.org/10.1016/j.jvoice.2022.03.021

A Multimodal Fusion Behaviors Estimation Method for Public Dangerous Monitoring

Renkai Hou, Xiangyang Xu†, Yaping Dai, Shuai Shao, and Kaoru Hirota

Renkai Hou, Xiangyang Xu^†, Yaping Dai, Shuai Shao, and Kaoru Hirota