An Improved Fully Convolutional Network Based on Post-Processing with Global Variance Equalization and Noise-Aware Training for Speech Enhancement

Wenlong Li; Kaoru Hirota; Yaping Dai; Zhiyang Jia

doi:10.20965/jaciii.2021.p0130

single-jc.php

« previous

JACIII Vol.25 No.1 pp. 130-137

doi: 10.20965/jaciii.2021.p0130

(2021)

Paper:

Views over last 60 days: 851

An Improved Fully Convolutional Network Based on Post-Processing with Global Variance Equalization and Noise-Aware Training for Speech Enhancement

Wenlong Li, Kaoru Hirota, Yaping Dai, and Zhiyang Jia^†

School of Automation, Beijing Institute of Technology
No.5 Zhongguancun South Street, Haidian District, Beijing 100081, China

^†Corresponding author

Received:

October 25, 2020

Accepted:

November 20, 2020

Published:

January 20, 2021

Keywords:

speech enhancement, fully convolutional network, post-processing with global variance equalization, noise-aware training

Abstract

An improved fully convolutional network based on post-processing with global variance (GV) equalization and noise-aware training (PN-FCN) for speech enhancement model is proposed. It aims at reducing the complexity of the speech improvement system, and it solves overly smooth speech signal spectrogram problem and poor generalization capability. The PN-FCN is fed with the noisy speech samples augmented with an estimate of the noise. In this way, the PN-FCN uses additional online noise information to better predict the clean speech. Besides, PN-FCN uses the global variance information, which improve the subjective score in a voice conversion task. Finally, the proposed framework adopts FCN, and the number of parameters is one-seventh of deep neural network (DNN). Results of experiments on the Valentini-Botinhaos dataset demonstrate that the proposed framework achieves improvements in both denoising effect and model training speed.

Cite this article as:

W. Li, K. Hirota, Y. Dai, and Z. Jia, “An Improved Fully Convolutional Network Based on Post-Processing with Global Variance Equalization and Noise-Aware Training for Speech Enhancement,” J. Adv. Comput. Intell. Intell. Inform., Vol.25 No.1, pp. 130-137, 2021.

Data files:

References

[1] S. R. Park and J. Lee, “A Fully Convolutional Neural Network for Speech Enhancement,” Proc. of INTERSPEECH, pp. 1993-1997, 2017.
[2] Y. Xu, J. Du, L. R. Dai et al., “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, Vol.23, No.1, pp. 7-19, 2015.
[3] X. Lu, Y. Tsao, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” Proc. of INTERSPEECH, pp. 436-440, 2013.
[4] S. Pascual, A. Bonafonte, and S. Joan, “SEGAN: Speech Enhancement Generative Adversarial Network,” Proc. of INTERSPEECH, pp. 3642-3646, 2017.
[5] S. Parveen and P. Green, “Speech enhancement with missing data techniques using recurrent neural networks,” Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 733-736, 2004.
[6] C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models,” University of Edinburgh, School of Informatics, Centre for Speech Technology Research (CSTR), doi: 10.7488/ds/1356, 2016.
[7] Y. Xu, J. Du, L. R. Dai et al., “An Experimental Study on Speech Enhancement Based on Deep Neural Networks,” IEEE Signal Processing Letters, Vol.21, No.1, pp. 65-68, 2014.
[8] K. Paliwal, B. Schwerin, and K. Wjcicki, “Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator,” Speech Communication, Vol.54, No.2 pp. 282-305, 2012.
[9] “P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs,” ITUT Std. P.862.2, 2007.
[10] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. on Audio, Speech, and Language Processing, Vol.16, No.1, pp. 229-238, Jan. 2008.
[11] Y. Xu, J. Du, L. R. Dai et al., “An Experimental Study on Speech Enhancement Based on Deep Neural Networks,” IEEE Signal Processing Letters, Vol.21, No.1, pp. 65-68, 2014.
[12] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” Proc. of INTERSPEECH, pp. 3274-3278, 2015.
[13] B. Xia and C. Bao, “Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification,” Speech Communication, Vol.60, pp. 13-29, 2014.
[14] Y. S. Park and S. M. Lee, “Speech enhancement through voice activity detection using speech absence probability based on Teager energy,” J. of Central South University, Vol.20, No.2, pp. 424-432, 2013.
[15] D. Michelsanti and Z.-H. Tan, “Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification,” Proc. of INTERSPEECH, pp. 2008-2012, 2017.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] S. R. Park and J. Lee, “A Fully Convolutional Neural Network for Speech Enhancement,” Proc. of INTERSPEECH, pp. 1993-1997, 2017.

[2] [2] Y. Xu, J. Du, L. R. Dai et al., “A Regression Approach to Speech Enhancement Based on Deep Neural Networks,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, Vol.23, No.1, pp. 7-19, 2015.

[3] [3] X. Lu, Y. Tsao, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on deep denoising autoencoder,” Proc. of INTERSPEECH, pp. 436-440, 2013.

[4] [4] S. Pascual, A. Bonafonte, and S. Joan, “SEGAN: Speech Enhancement Generative Adversarial Network,” Proc. of INTERSPEECH, pp. 3642-3646, 2017.

[5] [5] S. Parveen and P. Green, “Speech enhancement with missing data techniques using recurrent neural networks,” Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 733-736, 2004.

[6] [6] C. Valentini-Botinhao, “Noisy speech database for training speech enhancement algorithms and TTS models,” University of Edinburgh, School of Informatics, Centre for Speech Technology Research (CSTR), doi: 10.7488/ds/1356, 2016.

[7] [7] Y. Xu, J. Du, L. R. Dai et al., “An Experimental Study on Speech Enhancement Based on Deep Neural Networks,” IEEE Signal Processing Letters, Vol.21, No.1, pp. 65-68, 2014.

[8] [8] K. Paliwal, B. Schwerin, and K. Wjcicki, “Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator,” Speech Communication, Vol.54, No.2 pp. 282-305, 2012.

[9] [9] “P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs,” ITUT Std. P.862.2, 2007.

[10] [10] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. on Audio, Speech, and Language Processing, Vol.16, No.1, pp. 229-238, Jan. 2008.

[11] [11] Y. Xu, J. Du, L. R. Dai et al., “An Experimental Study on Speech Enhancement Based on Deep Neural Networks,” IEEE Signal Processing Letters, Vol.21, No.1, pp. 65-68, 2014.

[12] [12] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” Proc. of INTERSPEECH, pp. 3274-3278, 2015.

[13] [13] B. Xia and C. Bao, “Wiener filtering based speech enhancement with Weighted Denoising Auto-encoder and noise classification,” Speech Communication, Vol.60, pp. 13-29, 2014.

[14] [14] Y. S. Park and S. M. Lee, “Speech enhancement through voice activity detection using speech absence probability based on Teager energy,” J. of Central South University, Vol.20, No.2, pp. 424-432, 2013.

[15] [15] D. Michelsanti and Z.-H. Tan, “Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification,” Proc. of INTERSPEECH, pp. 2008-2012, 2017.

An Improved Fully Convolutional Network Based on Post-Processing with Global Variance Equalization and Noise-Aware Training for Speech Enhancement

Wenlong Li, Kaoru Hirota, Yaping Dai, and Zhiyang Jia†

Wenlong Li, Kaoru Hirota, Yaping Dai, and Zhiyang Jia^†