Pitch-Cluster-Map Based Daily Sound Recognition for Mobile Robot Audition

Yoko Sasaki; Masahito Kaneyoshi; Satoshi Kagami; Hiroshi Mizoguchi; Tadashi Enomoto

doi:10.20965/jrm.2010.p0402

single-rb.php

« previous

JRM Vol.22 No.3 pp. 402-410

doi: 10.20965/jrm.2010.p0402

(2010)

Paper:

Views over last 60 days: 611

Pitch-Cluster-Map Based Daily Sound Recognition for Mobile Robot Audition

Yoko Sasaki^, Masahito Kaneyoshi^, Satoshi Kagami^,
Hiroshi Mizoguchi^,, and Tadashi Enomoto^*

^*Digital Human Research Center, National Institute of Advanced Science and Technology, 2-3-26 Aomi, Kouto-ku, Tokyo 135-0064, Japan.

^**Dept. of Mechanical Engineering, Tokyo University of Science, 2641 Yamazaki, Noda-shi, Chiba 278-8510, Japan.

^***The Kansai Electric Power Co. Inc., 3-11-20 Nakoji, Amagasaki, Hyogo 661-0974, Japan.

Received:

September 30, 2009

Accepted:

April 12, 2010

Published:

June 20, 2010

Keywords:

sound identification, microphone array, mobile robot

Abstract

This paper presents a sound identification method for a mobile robot in home and office environments. We propose a short-term sound recognition method using Pitch-Cluster-Maps (PCMs) sound database (DB) based on a Vector Quantization approach. A binarized frequency spectrum is used to generate PCMs codebook, which describes a variety of sound sources, not only voice, from short-term sound input. PCMs sound identification requires several tens of milliseconds of sound input, and is suitable for mobile robot applications in which conditions are continuously and dynamically changing. We implemented this in mobile robot audition system using a 32-channel microphone array. Robot noise reduction and sound source tracking using our proposal are applied to robot audition system, and we evaluate daily sound recognition performance for separated sound sources from a moving robot.

Cite this article as:

Y. Sasaki, M. Kaneyoshi, S. Kagami, H. Mizoguchi, and T. Enomoto, “Pitch-Cluster-Map Based Daily Sound Recognition for Mobile Robot Audition,” J. Robot. Mechatron., Vol.22 No.3, pp. 402-410, 2010.

Data files:

References

[1] S. Furui, “50 years of progress in speech and speaker recognition,” In Proc. of SPECOM2005, Patras, Greece, pp. 1-9, 2005.
[2] T. Matsui and K. Tanabe, “Comparative study of speaker identification methods : dplrm, svm and gmm,” IEICE Trans. on INFOMATION and SYSTEMS, Vol.89-D, No.3, pp. 1066-1073, Mar., 2006.
[3] N. Roman and D. L. Wang, “Pitch-based monaural segregation of reverberant speech,” J. of Acoustics Society of America, Vol.120, No.1, pp. 458-469, Jul., 2006.
[4] Y. Shao and D. L. Wang, “Model-based sequential organization in cochannel speech,” IEEE Trans. on Audio, Speech, and Language Processing, Vol.14, No.1, pp. 289-298, Jan., 2006.
[5] M. Goto, “Analysis of musical audio signals. In D. L. Wang and G. J. Brown, editors, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications,” Wiley-IEEE Press, pp. 251-295. 2006.
[6] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Singer identification based on accompaniment sound reduction and reliable frame selection,” In Proc. of 6th Int. Conf. on Music Information Retrieval (ISMIR2005), London, U.K., pp. 329-336, Sep., 2005.
[7] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Speaker identification under noisy environments by using harmonic structure extraction and reliable frame weighting,” In Proc. of Int. Conf. on Spoken Language Proc. (Interspeech2006), Pittsburgh PA, USA, pp. 1459-1462, Sep., 2006.
[8] J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue, “Bathroom activity monitoring based on sound,” Pervasive Computing: Lecture notes in Computer Science, Vol.3468, pp. 47-61, May, 2005.
[9] K. Hiyane and J. Iio, “Non-speech sound recognition with microphone array,” In Proc. of IEEE Int. Workshop on Hands-Free Speech Communication (HSC2001), Kyoto, Japan, pp. 107-110, Apr., 2001.
[10] P. Lukowicz, J. AWard, H. Junker, M. Stager, G. Troster, A. Atrash, and T. Starner, “Recognizing workshop activity using body worn microphones and accelerometers,” Pervasive Computing: Lecture notes in Computer Science, Vol.3001, pp. 18-32, May, 2004.
[11] S. Tokutsu, K. Okada, and M. Inaba, “Discrimination of daily sounds for humanoids understanding situations,” In Proc. of the 25th annual conf. of the Robotics Society of Japan, p. 1H36, Sep., 2007. (in japanese)
[12] Y. Sasaki, S. Kagami, and H. Mizoguchi, “Simple sound source detection using main-lobe model of microphone array,” In Proc. of the 25th annual conference of the Robotics Society of Japan, Chiba, Japan, p. 1N13, Sep., 2007. (in japanese)
[13] Y. Tamai, Y. Sasaki, S. Kagami, and H. Mizoguchi, “Three ring microphone array for 3d sound localization and separation for mobile robot audition,” In Proc. of 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS2005), Edmonton, Canada, pp. 903-908, Aug., 2005.
[14] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda, “Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones,” Acoustical Science and Technology, Vol.22, No.2, pp. 149-157, 2001.
[15] T. Kinnunen and H. Li, “An overview of textindependent speaker recognition: From features to supervectors,” Speech Communication, Vol.52, No.1, pp. 12-40, Jan. 2010.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] S. Furui, “50 years of progress in speech and speaker recognition,” In Proc. of SPECOM2005, Patras, Greece, pp. 1-9, 2005.

[2] [2] T. Matsui and K. Tanabe, “Comparative study of speaker identification methods : dplrm, svm and gmm,” IEICE Trans. on INFOMATION and SYSTEMS, Vol.89-D, No.3, pp. 1066-1073, Mar., 2006.

[3] [3] N. Roman and D. L. Wang, “Pitch-based monaural segregation of reverberant speech,” J. of Acoustics Society of America, Vol.120, No.1, pp. 458-469, Jul., 2006.

[4] [4] Y. Shao and D. L. Wang, “Model-based sequential organization in cochannel speech,” IEEE Trans. on Audio, Speech, and Language Processing, Vol.14, No.1, pp. 289-298, Jan., 2006.

[5] [5] M. Goto, “Analysis of musical audio signals. In D. L. Wang and G. J. Brown, editors, Computational Auditory Scene Analysis: Principles, Algorithms, and Applications,” Wiley-IEEE Press, pp. 251-295. 2006.

[6] [6] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Singer identification based on accompaniment sound reduction and reliable frame selection,” In Proc. of 6th Int. Conf. on Music Information Retrieval (ISMIR2005), London, U.K., pp. 329-336, Sep., 2005.

[7] [7] H. Fujihara, T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, “Speaker identification under noisy environments by using harmonic structure extraction and reliable frame weighting,” In Proc. of Int. Conf. on Spoken Language Proc. (Interspeech2006), Pittsburgh PA, USA, pp. 1459-1462, Sep., 2006.

[8] [8] J. Chen, A. H. Kam, J. Zhang, N. Liu, and L. Shue, “Bathroom activity monitoring based on sound,” Pervasive Computing: Lecture notes in Computer Science, Vol.3468, pp. 47-61, May, 2005.

[9] [9] K. Hiyane and J. Iio, “Non-speech sound recognition with microphone array,” In Proc. of IEEE Int. Workshop on Hands-Free Speech Communication (HSC2001), Kyoto, Japan, pp. 107-110, Apr., 2001.

[10] [10] P. Lukowicz, J. AWard, H. Junker, M. Stager, G. Troster, A. Atrash, and T. Starner, “Recognizing workshop activity using body worn microphones and accelerometers,” Pervasive Computing: Lecture notes in Computer Science, Vol.3001, pp. 18-32, May, 2004.

[11] [11] S. Tokutsu, K. Okada, and M. Inaba, “Discrimination of daily sounds for humanoids understanding situations,” In Proc. of the 25th annual conf. of the Robotics Society of Japan, p. 1H36, Sep., 2007. (in japanese)

[12] [12] Y. Sasaki, S. Kagami, and H. Mizoguchi, “Simple sound source detection using main-lobe model of microphone array,” In Proc. of the 25th annual conference of the Robotics Society of Japan, Chiba, Japan, p. 1N13, Sep., 2007. (in japanese)

[13] [13] Y. Tamai, Y. Sasaki, S. Kagami, and H. Mizoguchi, “Three ring microphone array for 3d sound localization and separation for mobile robot audition,” In Proc. of 2005 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS2005), Edmonton, Canada, pp. 903-908, Aug., 2005.

[14] [14] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda, “Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones,” Acoustical Science and Technology, Vol.22, No.2, pp. 149-157, 2001.

[15] [15] T. Kinnunen and H. Li, “An overview of textindependent speaker recognition: From features to supervectors,” Speech Communication, Vol.52, No.1, pp. 12-40, Jan. 2010.

Pitch-Cluster-Map Based Daily Sound Recognition for Mobile Robot Audition

Yoko Sasaki*, Masahito Kaneyoshi*, Satoshi Kagami*, Hiroshi Mizoguchi*,**, and Tadashi Enomoto***

Yoko Sasaki^, Masahito Kaneyoshi^, Satoshi Kagami^,
Hiroshi Mizoguchi^,, and Tadashi Enomoto^*