JACIII Vol.28 No.3 pp. 520-527
doi: 10.20965/jaciii.2024.p0520

Research Paper:

A Multimodal Fusion Behaviors Estimation Method for Public Dangerous Monitoring

Renkai Hou, Xiangyang Xu, Yaping Dai, Shuai Shao, and Kaoru Hirota

Beijing Institute of Technology
No.5 Zhongguancun South Street, Haidian District, Beijing 100081, China

Corresponding author

March 25, 2023
December 6, 2023
May 20, 2024
group behavior recognition, speech emotion recognition, multimodal fusion, deep learning

At the present stage, the identification of dangerous behaviors in public places mostly relies on manual work, which is subjective and has low identification efficiency. This paper proposes an automatic identification method for dangerous behaviors in public places, which analyzes group behavior and speech emotion through deep learning network and then performs multimodal information fusion. Based on the fusion results, people can judge the emotional atmosphere of the crowd, make early warning, and alarm for possible dangerous behaviors. Experiments show that the algorithm adopted in this paper can accurately identify dangerous behaviors and has great application value.

Cite this article as:
R. Hou, X. Xu, Y. Dai, S. Shao, and K. Hirota, “A Multimodal Fusion Behaviors Estimation Method for Public Dangerous Monitoring,” J. Adv. Comput. Intell. Intell. Inform., Vol.28 No.3, pp. 520-527, 2024.
Data files:
  1. [1] W.-C. Wang, C. S. Chien, and L. Moutinho, “Do You Really Feel Happy? Some Implications of Voice Emotion Response in Mandarin Chinese,” Marketing Letters, Vol.26, No.3, pp. 391-409, 2015.
  2. [2] G. Johansson, “Visual perception of biological motion and a model for its analysis,” Perception & Psychophysics, Vol.14, No.2, pp. 201-211, 1973.
  3. [3] W. Choi and S. Savarese, “Understanding collective activities of people from videos,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.36, No.6, pp. 1242-1257, 2014.
  4. [4] L.-C. Chen et al., “Learning deep structured models,” Proc. of the 32nd Int. Conf. on Machine Learning, pp. 1785-1794, 2015.
  5. [5] Z. Wu, D. Lin, and X. Tang, “Deep Markov random field for image modeling,” Proc. of the 14th European Conf. on Computer Vision (ECCV 2016), Part VIII, pp. 295-312, 2016.
  6. [6] M. R. Amer et al., “Cost-sensitive top-down/bottom-up inference for multiscale activity recognition,” Proc. of the 12th European Conf. on Computer Vision (ECCV 2012), Part IV, pp. 187-200, 2012.
  7. [7] T. Shu et al., “Joint inference of groups, events and human roles in aerial videos,” 2015 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 4576-4584, 2015.
  8. [8] T. Bagautdinov et al., “Social scene understanding: End-to-end multi-person action localization and collective activity recognition,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3425-3434, 2017.
  9. [9] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Proc. of the 27th Int. Conf. on Neural Information Processing Systems (NIPS’14), Vol.1, pp. 568-576, 2014.
  10. [10] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” 2016 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1933-1941, 2016.
  11. [11] M. Wang, B. Ni, and X. Yang, “Recurrent modeling of interaction context for collective activity recognition,” 2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 7408-7416, 2017.
  12. [12] J. Zhao et al., “Cognitive psychology-based artificial intelligence review,” Frontiers in Neuroscience, Vol.16, Article No.1024316, 2022.
  13. [13] S. Ghosh et al., “Representation learning for speech emotion recognition,” Proc. of the 17th Annual Conf. of the Int. Speech Communication Association (Interspeech 2016), pp. 3603-3607, 2016.
  14. [14] A. M. Badshah et al., “Deep features-based speech emotion recognition for smart affective services,” Multimedia Tools and Applications, Vol.78, No.5, pp. 5571-5589, 2019.
  15. [15] L. A. Chris, “Recognizing human emotions using emotional transition lines in eigenspace,” Proc. of 2010 2nd Int. Conf. on Multimedia and Computational Intelligence (ICMCI 2010), pp. 316-319, 2010.
  16. [16] P. Sreevidya, S. Veni, and O. V. R. Murthy, “Elder emotion classification through multimodal fusion of intermediate layers and cross-modal transfer learning,” Signal, Image and Video Processing, Vol.16, No.5, pp. 1281-1288, 2022.
  17. [17] Z. Yan, C. Kou, and W. Ou, “Research of face anti-spoofing algorithm based on multi-modal fusion,” Computer Technology and Development, Vol.32, No.4, pp. 63-68+85, 2022 (in Chinese).
  18. [18] A. Bhateja et al., “Depth analysis of Kinect v2 sensor in different mediums,” Multimedia Tools and Applications, Vol.81, No.25, pp. 35775-35800, 2022.
  19. [19] T. Feng and S. Yang, “Speech emotion recognition based on lSTM and mel scale wavelet packet decomposition,” Proc. of the 2018 Int. Conf. on Algorithms, Computing and Artificial Intelligence (ACAI’18), Article No.38, 2018.
  20. [20] S. Tirronen, S. R. Kadiri, and P. Alku, “The effect of the MFCC frame length in automatic voice pathology detection,” J. of Voice, 2022.

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Jul. 12, 2024