Recognition of Emotions on the Basis of Different Levels of Speech Segments

Klára Vicsi; Dávid Sztahó

doi:10.20965/jaciii.2012.p0335

single-jc.php

« previous

JACIII Vol.16 No.2 pp. 335-340

doi: 10.20965/jaciii.2012.p0335

(2012)

Paper:

Views over last 60 days: 564

Recognition of Emotions on the Basis of Different Levels of Speech Segments

Klára Vicsi and Dávid Sztahó

Department of Telecommunication and Media Informatics, Budapest University of Technology and Economics, 2 Magyar tudósok körútja, Budapest 1117, Hungary

Received:

September 15, 2011

Accepted:

November 15, 2011

Published:

March 20, 2012

Keywords:

emotion recognition, intonational phrase, support vector machines

Abstract

Emotions play a very important role in human-human and human-machine communication. They can be expressed by voice, bodily gestures, and facial movements. People’s acceptance of any kind of intelligent device depends, to a large extent, on how the device reflects emotions. This is the reason why automatic emotion recognition is a recent research topic. In this paper we deal with automatic emotion recognition from human voice. Numerous papers in this field deal with database creation and with the examination of acoustic features appropriate for such recognition, but only few attempts were made to compare different emotional segmentation units that are needed to recognize the emotions in spontaneous speech properly. In the Laboratory of Speech Acoustics experiments were ran to examine the effect of diverse speech segment lengths on recognition performance. An emotional database was prepared on the basis of three different segmentation levels: word, intonational phrase and sentence. Automatic recognition tests were conducted using support vector machines with four basic emotions: neutral, anger, sadness, and joy. The analysis of the results clearly shows that intonation phrase-sized speech units give the best performance in emotional recognition in continuous speech.

Cite this article as:

K. Vicsi and D. Sztahó, “Recognition of Emotions on the Basis of Different Levels of Speech Segments,” J. Adv. Comput. Intell. Intell. Inform., Vol.16 No.2, pp. 335-340, 2012.

Data files:

References

[1] V. C. Müller, “Interaction and resistance: the recognition of intentions in new human-computer interaction,” In A. Esposito, A. M. Esposito, R. Martone, V. C., Müller, and G. Scarpetta (Eds.), Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues, Proc. of the Third COST 2102 Int. training school Conf., Springer-Verlag, Berlin, Heidelberg, pp. 1-7, 2011.
[2] L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural Networks, Vol.18, Issue 4, Emotion and Brain, May 2005, pp. 407-422, , DOI: 10.1016/j.neunet.2005.03.007, 2005.
[3] T. Vogt, E. André, and J.Wagner, “Automatic Recognition of Emotions from Speech: a Review of the Literature and Recommendation for Practical Realization,” Affect and Emotion in Human-Computer Interaction, Springer-Verlag Berlin, pp. 75-91, 2008. ISBN: 978-3-540-85098
[4] F. Burkhardt, A. Paeschke, et al., “A database of German Emotional Speech,” Proc. Of Interspeech 2005, pp. 1517-1520, 2005.
[5] V. Hozjan and Z. Kacic, “A rule-based emotion-dependent feature extraction method for emotion analysis from speech,” The J. of the Acoustical Society of America, Vol.119, Issue 5, pp. 3109-3120, 2006.
[6] F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-D activationvalence-time continuum using acoustic and linguistic cues,” J. on Multimodal User Interfaces, Vol.3, Issue 1, Springer Berlin, Heidelberg, pp. 7-19, 2010.
[7] S. Steidl, A. Batliner, E. Nöth, and J. Hornegger, “Quantification of Segmentation and F0 Errors and Their Effect on Emotion Recognition,” TSD ��08 Proc. of the 11th Int. Conf. on Text, Speech and Dialogue, Springer-Verlag Berlin, Heidelberg, pp. 525-534, 2008. ISBN: 978-3-540-87390-7
[8] A. Batliner, D. Seppi, S. Steidl, and B. Schuller, “Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach,” Advances in Human-Computer Interaction, Vol.2010, Article ID 782802, p. 15, doi: 10.1155/2010/782802, 2010.
[9] B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, and A. Wendemuth, “Comparing one and two-stage acoustic modeling in the recognition of emotion in speech,” Automatic Speech Recognition & Understanding, 2007, ASRU, IEEEWorkshop, pp. 596-600, 2007.
[10] T. Vogt and E. André, “An Evaluation of Emotion Units and Feature Types for Real-Time Speech Emotion Recognition,” In KI – Künstliche Intelligenz, Springer Berlin, Heidelberg, pp. 1-11, 2011.
[11] E. Selkirk, “The syntax-phonology interface,” In N. J. Smelser and P. B. Baltes (Eds.), Int. Encyclopedia of the Social and Behavioural Sciences, Oxford: Pergamon, pp. 15407-15412, 2001.
[12] J. W. Du Bois, S. Schuetze-Coburn, S. Cumming, and D. Paolino, “Outline of discourse transcription,” In J. A. Edwards and M. D. Lampert (Eds.), Talking data, Transcription and coding in discourse research, Lawrence Erlbaum, Hillsdale, pp. 45-89, 1993.
[13] Sz. Tóth, D. Sztahó, and K. Vicsi, “Speech Emotion Perception by Human and Machine,” Proc. of COST Action 2102 Int. Conf., Patras, Greece, October 29-31, 2007, Revised version in: Papers in Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction 2008, Springer LNCS, pp. 213-224, 2008.
[14] P. Boersma and D. Weenink, “Praat: doing phonetics by computer [Computer program],” 2011.
http://www.praat.org/
[15] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Trans. on Intelligent Systems and Technology, 2:27:1-27:27, 2011.
http://www.csie.ntu.edu.tw/∼cjlin/libsvm
[16] K. Vicsi and Gy. Szaszák, “Using prosody to improve automatic speech recognition,” In Speech Communication, Vol.52, pp. 413-426, 2010.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] V. C. Müller, “Interaction and resistance: the recognition of intentions in new human-computer interaction,” In A. Esposito, A. M. Esposito, R. Martone, V. C., Müller, and G. Scarpetta (Eds.), Toward autonomous, adaptive, and context-aware multimodal interfaces: theoretical and practical issues, Proc. of the Third COST 2102 Int. training school Conf., Springer-Verlag, Berlin, Heidelberg, pp. 1-7, 2011.

[2] [2] L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural Networks, Vol.18, Issue 4, Emotion and Brain, May 2005, pp. 407-422, , DOI: 10.1016/j.neunet.2005.03.007, 2005.

[3] [3] T. Vogt, E. André, and J.Wagner, “Automatic Recognition of Emotions from Speech: a Review of the Literature and Recommendation for Practical Realization,” Affect and Emotion in Human-Computer Interaction, Springer-Verlag Berlin, pp. 75-91, 2008. ISBN: 978-3-540-85098

[4] [4] F. Burkhardt, A. Paeschke, et al., “A database of German Emotional Speech,” Proc. Of Interspeech 2005, pp. 1517-1520, 2005.

[5] [5] V. Hozjan and Z. Kacic, “A rule-based emotion-dependent feature extraction method for emotion analysis from speech,” The J. of the Acoustical Society of America, Vol.119, Issue 5, pp. 3109-3120, 2006.

[6] [6] F. Eyben, M. Wöllmer, A. Graves, B. Schuller, E. Douglas-Cowie, and R. Cowie, “On-line emotion recognition in a 3-D activationvalence-time continuum using acoustic and linguistic cues,” J. on Multimodal User Interfaces, Vol.3, Issue 1, Springer Berlin, Heidelberg, pp. 7-19, 2010.

[7] [7] S. Steidl, A. Batliner, E. Nöth, and J. Hornegger, “Quantification of Segmentation and F0 Errors and Their Effect on Emotion Recognition,” TSD ��08 Proc. of the 11th Int. Conf. on Text, Speech and Dialogue, Springer-Verlag Berlin, Heidelberg, pp. 525-534, 2008. ISBN: 978-3-540-87390-7

[8] [8] A. Batliner, D. Seppi, S. Steidl, and B. Schuller, “Segmenting into Adequate Units for Automatic Recognition of Emotion-Related Episodes: A Speech-Based Approach,” Advances in Human-Computer Interaction, Vol.2010, Article ID 782802, p. 15, doi: 10.1155/2010/782802, 2010.

[9] [9] B. Schuller, B. Vlasenko, R. Minguez, G. Rigoll, and A. Wendemuth, “Comparing one and two-stage acoustic modeling in the recognition of emotion in speech,” Automatic Speech Recognition & Understanding, 2007, ASRU, IEEEWorkshop, pp. 596-600, 2007.

[10] [10] T. Vogt and E. André, “An Evaluation of Emotion Units and Feature Types for Real-Time Speech Emotion Recognition,” In KI – Künstliche Intelligenz, Springer Berlin, Heidelberg, pp. 1-11, 2011.

[11] [11] E. Selkirk, “The syntax-phonology interface,” In N. J. Smelser and P. B. Baltes (Eds.), Int. Encyclopedia of the Social and Behavioural Sciences, Oxford: Pergamon, pp. 15407-15412, 2001.

[12] [12] J. W. Du Bois, S. Schuetze-Coburn, S. Cumming, and D. Paolino, “Outline of discourse transcription,” In J. A. Edwards and M. D. Lampert (Eds.), Talking data, Transcription and coding in discourse research, Lawrence Erlbaum, Hillsdale, pp. 45-89, 1993.

[13] [13] Sz. Tóth, D. Sztahó, and K. Vicsi, “Speech Emotion Perception by Human and Machine,” Proc. of COST Action 2102 Int. Conf., Patras, Greece, October 29-31, 2007, Revised version in: Papers in Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction 2008, Springer LNCS, pp. 213-224, 2008.

[14] [14] P. Boersma and D. Weenink, “Praat: doing phonetics by computer [Computer program],” 2011.
http://www.praat.org/

[15] [15] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Trans. on Intelligent Systems and Technology, 2:27:1-27:27, 2011.
http://www.csie.ntu.edu.tw/∼cjlin/libsvm

[16] [16] K. Vicsi and Gy. Szaszák, “Using prosody to improve automatic speech recognition,” In Speech Communication, Vol.52, pp. 413-426, 2010.