Simultaneous Execution of Dereverberation, Denoising, and Speaker Separation Using a Neural Beamformer for Adapting Robots to Real Environments

Daichi Nagano; Kazuo Nakazawa

doi:10.20965/jrm.2022.p1399

single-rb.php

« previous

JRM Vol.34 No.6 pp. 1399-1410

doi: 10.20965/jrm.2022.p1399

(2022)

Paper:

Views over last 60 days: 434

Simultaneous Execution of Dereverberation, Denoising, and Speaker Separation Using a Neural Beamformer for Adapting Robots to Real Environments

Daichi Nagano and Kazuo Nakazawa

Faculty of Science and Technology, Keio University
3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan

Received:

March 18, 2022

Accepted:

July 5, 2022

Published:

December 20, 2022

Keywords:

communication robot, neural beamformer, dereverberation, denoising, speech extraction

Abstract

It remains challenging for robots to accurately perform sound source localization and speech recognition in a real environment with reverberation, noise, and the voices of multiple speakers. Accordingly, we propose “U-TasNet-Beam,” a speech extraction method for extracting only the target speaker’s voice from all ambient sounds in a real environment. U-TasNet-Beam is a neural beamformer comprising three elements: a neural network for removing reverberation and noise, a second neural network for separating the voices of multiple speakers, and a minimum variance distortionless response (MVDR) beamformer. Experiments with simulated data and recorded data show that the proposed U-TasNet-Beam can improve the accuracy of sound source localization and speech recognition in robots compared to the conventional methods in a noisy, reverberant, and multi-speaker environment. In addition, we propose the spatial correlation matrix loss (SCM loss) as a loss function for the neural network learning the spatial information of the sound. By using the SCM loss, we can improve the speech extraction performance of the neural beamformer.

Schematic view of U-TasNet-Beam

Cite this article as:

D. Nagano and K. Nakazawa, “Simultaneous Execution of Dereverberation, Denoising, and Speaker Separation Using a Neural Beamformer for Adapting Robots to Real Environments,” J. Robot. Mechatron., Vol.34 No.6, pp. 1399-1410, 2022.

Data files:

References

[1] O. Sugiyama, S. Uemura, A. Nagamine, R. Kojima, K. Nakamura, and K. Nakadai, “Outdoor acoustic event identification with DNN using a quadrotor-embedded microphone array,” J. Robot. Mechatron., Vol.29, No.1, pp. 188-197, 2017.
[2] K. Sekiguchi, Y. Bando, K. Itoyama, and K. Yoshii, “Layout optimization of cooperative distributed microphone arrays based on estimation of sound source performance,” J. Robot. Mechatron., Vol.29, No.1, pp. 83-93, 2017.
[3] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol.401, pp. 788-791, 1999.
[4] A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing Voice Separation with Deep U-Net Convolutional Networks,” The 18th Int. Society for Music Information Retrieval Conf., 2017.
[5] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. on Audio, Speech, and Language processing, Vol.27, No.8, pp. 1256-1266, 2019.
[6] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021.
[7] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 5554-5558, 2018.
[8] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voice-Filter: Targeted voice separation by speaker-conditioned spectrogram masking,” Proc. of Interspeech, pp. 2728-2732, 2019.
[9] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction,” IEEE Trans. on Audio, Speech, Language Process., Vol.18, No.2, pp. 260-276, 2010.
[10] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Trans. on Audio, Speech, Language Process, Vol.15, No.5, pp. 1529-1539, 2007.
[11] P. Comon, “Independent component analysis, a new concept?,” Signal Processing, Vol.36, No.3, pp. 287-314, 1994.
[12] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Trans. on Audio, Speech and Language Processing, Vol.24, No.9, pp. 1626-1641, 2016.
[13] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 196-200, 2016.
[14] W. Jiang, F. Wen, and P. Liu, “Robust beamforming for speech recognition using DNN-based time-frequency masks estimation,” IEEE Access, Vol.6, pp. 52385-52392, 2018.
[15] T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-Tasnet: Time-domain audio separation network meets frequency-domain beamformer,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6384-6388, 2020.
[16] R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. on Antennas and Propagation, Vol.34, No.3, pp. 276-280, 1986.
[17] H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, “Phase-aware speech enhancement with deep complex u-net,” Int. Conf. on Learning, Representations, 2018.
[18] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 241-245, 2017.
[19] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 4879-4888, 2018.
[20] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey, A. R. Saurous, J. Skoglund, and F. R. Lyon, “Exploring tradeoffs in models for low-latency speech enhancement,” The 16th Int. Workshop on Acoustic Signal Enhancement, 2018.
[21] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2018.
[22] J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in reverberant environments,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020.
[23] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” Proc. of Interspeech, pp. 352-356, 2016.
[24] R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A Python package for audio room simulations and array processing algorithms,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2018.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] O. Sugiyama, S. Uemura, A. Nagamine, R. Kojima, K. Nakamura, and K. Nakadai, “Outdoor acoustic event identification with DNN using a quadrotor-embedded microphone array,” J. Robot. Mechatron., Vol.29, No.1, pp. 188-197, 2017.

[2] [2] K. Sekiguchi, Y. Bando, K. Itoyama, and K. Yoshii, “Layout optimization of cooperative distributed microphone arrays based on estimation of sound source performance,” J. Robot. Mechatron., Vol.29, No.1, pp. 83-93, 2017.

[3] [3] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, Vol.401, pp. 788-791, 1999.

[4] [4] A. Jansson, E. J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, and T. Weyde, “Singing Voice Separation with Deep U-Net Convolutional Networks,” The 18th Int. Society for Music Information Retrieval Conf., 2017.

[5] [5] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE/ACM Trans. on Audio, Speech, and Language processing, Vol.27, No.8, pp. 1256-1266, 2019.

[6] [6] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2021.

[7] [7] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 5554-5558, 2018.

[8] [8] Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “Voice-Filter: Targeted voice separation by speaker-conditioned spectrogram masking,” Proc. of Interspeech, pp. 2728-2732, 2019.

[9] [9] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction,” IEEE Trans. on Audio, Speech, Language Process., Vol.18, No.2, pp. 260-276, 2010.

[10] [10] E. Warsitz and R. Haeb-Umbach, “Blind acoustic beamforming based on generalized eigenvalue decomposition,” IEEE Trans. on Audio, Speech, Language Process, Vol.15, No.5, pp. 1529-1539, 2007.

[11] [11] P. Comon, “Independent component analysis, a new concept?,” Signal Processing, Vol.36, No.3, pp. 287-314, 1994.

[12] [12] D. Kitamura, N. Ono, H. Sawada, H. Kameoka, and H. Saruwatari, “Determined blind source separation unifying independent vector analysis and nonnegative matrix factorization,” IEEE/ACM Trans. on Audio, Speech and Language Processing, Vol.24, No.9, pp. 1626-1641, 2016.

[13] [13] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, pp. 196-200, 2016.

[14] [14] W. Jiang, F. Wen, and P. Liu, “Robust beamforming for speech recognition using DNN-based time-frequency masks estimation,” IEEE Access, Vol.6, pp. 52385-52392, 2018.

[15] [15] T. Ochiai, M. Delcroix, R. Ikeshita, K. Kinoshita, T. Nakatani, and S. Araki, “Beam-Tasnet: Time-domain audio separation network meets frequency-domain beamformer,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6384-6388, 2020.

[16] [16] R. Schmidt, “Multiple emitter location and signal parameter estimation,” IEEE Trans. on Antennas and Propagation, Vol.34, No.3, pp. 276-280, 1986.

[17] [17] H. S. Choi, J. H. Kim, J. Huh, A. Kim, J. W. Ha, and K. Lee, “Phase-aware speech enhancement with deep complex u-net,” Int. Conf. on Learning, Representations, 2018.

[18] [18] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 241-245, 2017.

[19] [19] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 4879-4888, 2018.

[20] [20] K. Wilson, M. Chinen, J. Thorpe, B. Patton, J. Hershey, A. R. Saurous, J. Skoglund, and F. R. Lyon, “Exploring tradeoffs in models for low-latency speech enhancement,” The 16th Int. Workshop on Acoustic Signal Enhancement, 2018.

[21] [21] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2018.

[22] [22] J. Zhang, C. Zorila, R. Doddipatla, and J. Barker, “On end-to-end multi-channel time domain speech separation in reverberant environments,” IEEE Int. Conf. on Acoustics, Speech and Signal Processing, 2020.

[23] [23] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” Proc. of Interspeech, pp. 352-356, 2016.

[24] [24] R. Scheibler, E. Bezzam, and I. Dokmanić, “Pyroomacoustics: A Python package for audio room simulations and array processing algorithms,” IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2018.