Multimodal Facial Emotion Recognition Using Improved Convolution Neural Networks Model
Chinonso Paschal Udeh*,**,***, Luefeng Chen*,**,***,, Sheng Du*,**,***, Min Li*,**,***, and Min Wu*,**,***
*School of Automation, China University of Geosciences
No.388 Lumo Road, Hongshan District, Wuhan 430074, China
**Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
No.388 Lumo Road, Hongshan District, Wuhan 430074, China
***Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
No.388 Lumo Road, Hongshan District, Wuhan 430074, China
In the quest for human-robot interaction (HRI), leading to the development of emotion recognition, learning, and analysis capabilities, robotics plays a significant role in human perception, attention, decision-making, and social communication. However, the accurate recognition of emotions in HRI remains a challenge. This is due to the coexistence of multiple sources of information in utilizing multimodal facial expressions and head poses as multiple convolutional neural networks (CNN) and deep learning are combined. This research analyzes and improves the robustness of emotion recognition, and proposes a novel approach that optimizes traditional deep neural networks that fall into poor local optima when optimizing the weightings of the deep neural network using standard methods. The proposed approach adaptively finds the better weightings of the network, resulting in a hybrid genetic algorithm with stochastic gradient descent (HGASGD). This hybrid algorithm combines the inherent, implicit parallelism of the genetic algorithm with the better global optimization of stochastic gradient descent (SGD). An experiment shows the effectiveness of our proposed approach in providing complete emotion recognition through a combination of multimodal data, CNNs, and HGASGD, indicating that it represents a powerful tool in achieving interactions between humans and robotics. To validate and test the effectiveness of our proposed approach through experiments, the performance and reliability of our approach and two variants of HGASGD FER are compared using a large dataset of facial images. Our approach integrates multimodal information from facial expressions and head poses, enabling the system to recognize emotions better. The results show that CNN-HGASGD outperforms CNNs-SGD and other existing state-of-the-art methods in terms of FER.
-  F. Foroni and G. R. Semin, “Language that puts you in touch with your bodily feelings: The multimodal responsiveness of affective expressions,” Psychological Science, Vol.20, No.8, pp. 974-980, 2009. https://doi.org/10.1111/j.1467-9280.2009.02400.x
-  A. L. Thomaz and C. Breazeal, “Teachable robots: Understanding human teaching behavior to build more effective robot learners,” Artificial Intelligence, Vol.172, No.6-7, pp. 716-737, 2008. https://doi.org/10.1016/j.artint.2007.09.009
-  C. Korsmeyer and R. W. Picard, “Affective Computing,” Minds and Machines, Vol.9, pp. 443-447, 1999. https://doi.org/10.1023/A:1008329803271
-  L. Chen, M. Wu, M. Zhou, Z. Liu, J. She, and K. Hirota, “Dynamic emotion understanding in human-robot interaction based on two-layer fuzzy SVR-TS model,” IEEE Trans. on Systems, Man, and Cybernetics: Systems, Vol.50, No.2, pp. 490-501, 2020. https://doi.org/10.1109/TSMC.2017.2756447
-  F. Afza, M. A. Khan, M. Sharif, S. Kadry, G. Manogaran, T. Saba, I. Ashraf, and R. Damaševičius, “A framework of human action recognition using length control features fusion and weighted entropy-variances based feature selection,” Image and Vision Computing, Vol.106, Article No.104090, 2021. https://doi.org/10.1016/j.imavis.2020.104090
-  A. R. Khan, “FER Using Conventional Machine Learning and Deep Learning Methods: Current Achievements, Analysis and Remaining Challenges,” Information, Vol.13, Article No.268, 2022. https://doi.org/10.3390/info13060268
-  J. Call and M. Carpenter, “Three sources of information in social learning,” K. Dautenhahn and C. L. Nehaniv (Eds.), “Imitation in animals and artifacts,” Boston Review, 2002.
-  M. Tomasello, “The cultural origins of human cognition,” Harvard University Press, 2000.
-  R. Toris, D. Kent, and S. Chernova, “The Robot Management System: A Framework for Conducting Human-Robot Interaction Studies Through Crowdsourcing,” J. of Human-Robot Interaction, Vol.3, No.2, pp. 25-49, 2014. https://doi.org/10.5898/JHRI.3.2.Toris
-  J. Tao and T. Tan, “Affective Computing: A Review,” Affective Computing and Intelligent Interaction, First Int. Conf. (ACII 2005), 2005. https://doi.org/10.1007/11573548_125
-  N. Ratliff, “Learning to Search: Structured Prediction Techniques for Imitation Learning,” Ph.D. Thesis, Carnegie Mellon University, 2009.
-  R. W. Picard, “Affective computing,” M.I.T Media Laboratory Perceptual Computing Section Technical Report, No.321, 1997.
-  B. Fasel and J. Luettin, “Automatic facial expression analysis: A survey,” Pattern Recognition, Vol.36, No.1, pp. 259-275, 2003. https://doi.org/10.1016/S0031-3203(02)00052-3
-  N. Elfaramawy, P. Barros, G. I. Parisi, and S. Wermter, “Emotion Recognition from Body Expressions with a Neural Network Architecture,” Proc. of the 5th Int. Conf. on Human Agent Interaction (HAI ’17), pp. 143-149, 2017. https://doi.org/10.1145/3125739.3125772
-  M. Soleymani, M. Pantic, and T. Pun, “Multimodal emotion recognition in response to videos (extended abstract),” Int. Conf. on Affective Computing and Intelligent Interaction (ACII 2015), 2015. https://doi.org/10.1109/ACII.2015.7344615
-  C. P. Udeh, L. Chen, and M. Wu, “FER using convolution neural networks-based deep learning model,” Proc. of the 7th Int. Workshop on Advanced Computational Intelligence and Intelligent Informatics (IWACIII2021), Article No.M1-7-5, 2021.
-  B. Zafar, R. Ashraf, N. Ali, M. K. Iqbal, M. Sajid, S. H. Dar, and N. I. Ratyal, “A novel discriminating and relative global spatial image representation with applications in CBIR,” Applied Sciences, Vol.8, No.11, Article No.2242, 2018. https://doi.org/10.3390/app8112242
-  N. Mehendale, “FER using convolutional neural networks (FERC),” SN Applied Sciences, Vol.2, No.3, Article No.446, 2020. https://doi.org/10.1007/s42452-020-2234-1
-  B. Ponsler, “Recognizing Engagement Behaviors in Human-Robot Interaction,” Master’s Theses, Worcester Polytechnic Institute, 2011.
-  A. Holroyd, C. Rich, C. L. Sidner, and B. Ponsler, “Generating connection events for human-robot collaboration,” IEEE Int. Workshop on Robot and Human Interactive Communication, pp. 241-246, 2011. https://doi.org/10.1109/ROMAN.2011.6005245
-  T. Kanda, H. Ishiguro, M. Imai, and T. Ono, “Development and evaluation of interactive humanoid robots,” Proc. of the IEEE, Vol.92, No.1, pp. 1839-1850, 2004. https://doi.org/10.1109/JPROC.2004.835359
-  M. Nakano, Y. Hasegawa, K. Funakoshi, J. Takeuchi, T. Torii, K. Nakadai, N. Kanda, K. Komatani, H. G. Okuno, and H. Tsujino, “A multi-expert model for dialogue and behavior control of conversational robots and agents,” J. of Knowledge-Based Systems, Vol.24, No.2, pp. 248-256, 2011. https://doi.org/10.1016/j.knosys.2010.08.004
-  C. Chao, “Timing multimodal turn-taking for human-robot cooperation,” Proc. of the 14th ACM Int. Conf. on Multimodal Interaction (ICMI ’12), pp. 309-312, 2012. https://doi.org/10.1145/2388676.2388744
-  C. Chao and A. L. Thomaz, “Controlling social dynamics with a parametrized model of floor regulation,” J. of Human-Robot Interaction, Vol.2, No.1, pp. 4-29, 2013. https://doi.org/10.5898/JHRI.2.1.Chao
-  S. Calinon, F. D’halluin, E. L. Sauser, D. G. Caldwell, and A. G. Billard, “Learning and Reproduction of Gestures by Imitation,” IEEE Robotics & Automation Magazine, Vol.17, No.2, pp. 44-54, 2010. https://doi.org/10.1109/MRA.2010.936947
-  A. N. Meltzoff, “The human infant as imitative generalist: A 20-year progress report on infant imitation with implications for comparative psychology,” C. M. Heyes and B. G. Galef, Jr. (Eds.), “Social learning in animals: The roots of culture,” pp. 347-370, Academic Press, 1996. https://doi.org/10.1016/B978-012273965-1/50017-0
-  M. Sajid, N. I. Ratyal, N. Ali, B. Zafar, S. H. Dar, M. T. Mahmood, and Y. B. Joo, “The impact of asymmetric left and asymmetric right face images on accurate age estimation,” J. of Mathematical Problems in Engineering, Vol.2019, Article No.8041413, 2019. https://doi.org/10.1155/2019/8041413
-  N. Ratyal, I. Taj, U. Bajwa, and M. Sajid, “Pose and expression invariant alignment based multi-view 3D face recognition,” KSII Trans. on Internet and Information Systems (TIIS), Vol.12, No.10, pp. 4903-4929, 2018. https://doi.org/10.3837/tiis.2018.10.016
-  S. Xie and H. Hu, “Facial expression recognition using hierarchical features with deep comprehensive multipatches aggregation convolutional neural networks,” IEEE Trans. on Multimedia, Vol.21, No.1, pp. 211-220, 2018. https://doi.org/10.1109/TMM.2018.2844085
-  B. Qin, L. Liang, J. Wu, Q. Quan, Z. Wang, and D. Li, “Automatic identification of down syndrome using facial images with deep convolutional neural network,” Diagnostics, Vol.10, No.7, Article No.487, 2020. https://doi.org/10.3390/diagnostics10070487
-  J. M. F. Dols and J. A. Russell, “The science of facial expression,” Oxford University Press, 2017.
-  P. E. Ekman, W. V. Friesen, and J. C. Hager, “Facial action coding system (FACS),” A Human Face, Salt Lake City, 2002.
-  J. Yan, Z. Lei, L. Wen, and S. Z. Li, “The fastest deformable part model for object detection,” Proc. of the 2014 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2497-2504, 2014. https://doi.org/10.1109/CVPR.2014.320
-  R. Cowie, E. Douglas-Cowie, J. G. Taylor, S. Ioannou, and S. D. Kollias, “An intelligent system for FER,” Proc. of the 2005 IEEE Int. Conf. on Multimedia and Expo (ICME), 2005. https://doi.org/10.1109/ICME.2005.1521570
-  H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural network cascade for face detection,” Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5325-5334, 2015. https://doi.org/10.1109/CVPR.2015.7299170
-  P. Barros, D. Jirak, C. Weber, and S. Wermter, “Multimodal emotional state recognition using sequence-dependent deep hierarchical features,” Neural Networks, Vol.72, pp. 140-151, 2015. https://doi.org/10.1016/j.neunet.2015.09.009
-  D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and J.-M. Odobez, “Deep dynamic neural networks for multimodal gesture segmentation and recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.38, No.8, pp. 1583-1597, 2016. https://doi.org/10.1109/TPAMI.2016.2537340
-  T. R. Schäfle, M. Mitschke, and N. Uchiyama, “Generation of optimal coverage paths for mobile robots using hybrid genetic algorithm,” J. Robot. Mechatron., Vol.33, No.1, pp. 11-23, 2021. https://doi.org/10.20965/jrm.2021.p0011
-  A. Behera, A. G. Gidney, Z. Wharton, D. Robinson, and K. Quinn, “A CNN model for head pose recognition using wholes and regions,” 2019 14th IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG 2019), 2019. https://doi.org/10.1109/FG.2019.8756536
-  L. Alzubaidi, J. Zhang, A. J. Humaidi, A. Al-Dujaili, Y. Duan, O. Al-Shamma, J. Santamarła, M. A. Fadhel, M. Al-Amidie, and L. Farhan, “Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions,” J. of Big Data, Vol.8, No.1, Article No.53, 2021. https://doi.org/10.1186/s40537-021-00444-8
-  Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proc. of the IEEE, Vol.86, No.11, pp. 2278-2324, 1998. https://doi.org/10.1109/5.726791
-  Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, Vol.52, No.7553, pp. 436-444, 2015. https://doi.org/10.1038/nature14539
-  S. Wang, S. Wu, G. Peng, and Q. Ji, “Capturing feature and label relations simultaneously for multiple facial action unit recognition,” IEEE Trans. on Affective Computing, Vol.10, Issue 3, pp. 348-359, 2019. https://doi.org/10.1109/TAFFC.2017.2737540
-  “A bimodal face and body pose database,” 2006. http://mmv.eecs.qmul.ac.uk/fabo/ [Accessed August 20, 2006]
-  P. Viola and M. J. Jones, “Robust real-time object detection,” Int. J. of Computer Vision, Vol.57, No.2, pp. 137-154, 2004. https://doi.org/10.1023/B:VISI.0000013087.49260.fb
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.