Speaker Localization Based on Audio-Visual Bimodal Fusion

Ying-Xin Zhu; Hao-Ran Jin

doi:10.20965/jaciii.2021.p0375

single-jc.php

« previous

JACIII Vol.25 No.3 pp. 375-382

doi: 10.20965/jaciii.2021.p0375

(2021)

Paper:

Views over last 60 days: 782

Speaker Localization Based on Audio-Visual Bimodal Fusion

Ying-Xin Zhu^,, and Hao-Ran Jin^,,,†

^*School of Automation, China University of Geosciences
388 Lumo Road, Hongshan District, Wuhan, Hubei 430074, China

^**Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems
Wuhan, Hubei 430074, China

^***Engineering Research Center of Intelligent Technology for Geo-Exploration, Ministry of Education
Wuhan, Hubei 430074, China

^†Corresponding author

Received:

March 20, 2021

Accepted:

March 31, 2021

Published:

May 20, 2021

Keywords:

speaker positioning, microphone array, image, information fusion

Abstract

The demand for fluency in human–computer interaction is on an increase globally; thus, the active localization of the speaker by the machine has become a problem worth exploring. Considering that the stability and accuracy of the single-mode localization method are low, while the multi-mode localization method can utilize the redundancy of information to improve accuracy and anti-interference, a speaker localization method based on voice and image multimodal fusion is proposed. First, the voice localization method based on time differences of arrival (TDOA) in a microphone array and the face detection method based on the AdaBoost algorithm are presented herein. Second, a multimodal fusion method based on spatiotemporal fusion of speech and image is proposed, and it uses a coordinate system converter and frame rate tracker. The proposed method was tested by positioning the speaker stand at 15 different points, and each point was tested 50 times. The experimental results demonstrate that there is a high accuracy when the speaker stands in front of the positioning system within a certain range.

Cite this article as:

Y. Zhu and H. Jin, “Speaker Localization Based on Audio-Visual Bimodal Fusion,” J. Adv. Comput. Intell. Intell. Inform., Vol.25 No.3, pp. 375-382, 2021.

Data files:

References

[1] G. Deak, K. Curran, and J. Condell, “A survey of active and passive indoor localisation systems,” Computer Communications, Vol.35, pp. 1939-1954, doi: 10.1016/j.comcom.2012.06.004, 2012.
[2] H. Lim, I. C. Yoo, Y. Cho, and D. Yook, “Speaker localization in noisy environments using steered response voice power,” IEEE Trans. on Consumer Electronics, Vol.61, No.1, pp. 112-118, doi: 10.1109/TCE.2015.7064118, 2015.
[3] A. Sepas-Moghaddam, F. M. Pereira, and P. L. Correia, “Face recognition: a novel multi-level taxonomy based survey,” IET Biometrics, Vol.9, No.2, pp. 58-67, doi: 10.1049/iet-bmt.2019.0001, 2020.
[4] J. Qu, H. Shi, N. Qiao, C. Wu, C. Su, and A. Razi, “New three-dimensional positioning algorithm through integrating TDOA and Newton’s method,” EURASIP J. on Wireless Communications and Networking, Article No.77, doi: 10.1186/s13638-020-01684-7, 2020.
[5] A. Pourmohammad and S. M. Ahadi, “Real Time High Accuracy 3-D PHAT-Based Sound Source Localization Using a Simple 4-Microphone Arrangement,” IEEE Systems J., Vol.6, No.3, pp. 455-468, doi: 10.1109/JSYST.2011.2176766, 2012.
[6] M. Gutiérrez-Muñoz, A. González-Salazar, and M. Coto-Jiménez, “Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement,” Biomimetics, Vol.5, doi: 10.3390/biomimetics5010001, 2019.
[7] D. Blacodon and J. Bulté, “Reverberation cancellation in a closed test section of a wind tunnel using a multi-microphone cesptral method,” J. of Sound and Vibration, Vol.333, pp. 2669-2687, doi: 10.1016/j.jsv.2013.12.012, 2014.
[8] M. Unoki and M. Akagi, “A method of signal extraction from noisy signal based on auditory scene analysis,” Speech Communication, Vol.27, pp. 261-279, doi: 10.1016/S0167-6393(98)00077-6, 1999.
[9] S. U. Pillai, Y. Bar-Ness, and F. Haber, “A new approach to array geometry for improved spatial spectrum estimation,” Proc. of the IEEE, Vol.73, No.10, pp. 1522-1524, doi: 10.1109/PROC.1985.13324, 1985.
[10] N. Roman, D. L. Wang, and G. J. Brown, “Speech segregation based on sound localization,” The J. of the Acoustical Society of America, Vol.114, pp. 2236-2252, doi: 10.1121/1.1610463, 2003.
[11] D. Salvati, C. Drioli, and G. L. Foresti, “A Low-Complexity Robust Beamforming Using Diagonal Unloading for Acoustic Source Localization,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, Vol.26, No.3, pp. 609-622, doi: 10.1109/TASLP.2017.2789321, 2018.
[12] Y. Guo, H. Zhu, and X. Dang, “Tracking multiple acoustic sources by adaptive fusion of TDOAs across microphone pairs,” Digital Signal Processing, Vol.106, doi: 10.1016/j.dsp.2020.102853, 2020.
[13] Z. Taha, J. Y. Chew, and H. J. Yap, “Omnidirectional Vision for Mobile Robot Navigation,” J. Adv. Comput. Intell. Intell. Inform., Vol.14, No.1, pp. 55-62, doi: 10.20965/jaciii.2010.p0055, 2010.
[14] Y. Maeda and W. Shimizuhira, “Multilayered Fuzzy Behavior Control for an Autonomous Mobile Robot with Multiple Omnidirectional Vision System: MOVIS,” J. Adv. Comput. Intell. Intell. Inform., Vol.11, No.1, pp. 21-27, doi: 10.20965/jaciii.2007.p0021, 2007.
[15] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.20, pp. 23-28, doi: 10.1109/34.655647, 1998.
[16] H. A. Rowley, S. Baluja, and T. Kanade, “Rotation invariant neural network-based face detection,” Proc. 1998 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 38-44, doi: 10.1109/CVPR.1998.698585, 1998.
[17] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. of Computer Vision, pp. 34-47, doi: 10.1117/12.228952, 2001.
[18] M. Melek, A. Khattab, and M. F. Abu-Elyazeed, “Fast matching pursuit for sparse representation-based face recognition,” IET Image Processing, Vol.12, No.10, pp. 1807-1814, doi: 10.1049/iet-ipr.2017.1263, 2018.
[19] Z. T. Liu, S. H. Li, W. H. Cao, D. Y. Li, and M. Hao, “Combining 2D gabor and local binary pattern for facial expression recognition using extreme learning machine,” J. Adv. Comput. Intell. Intell. Inform., Vol.23, No.3, pp. 444-455, doi: 10.20965/jaciii.2019.p0444, 2019.
[20] K. M. Kudiri, A. M. Said, and M. Y. Nayan, “Human emotion detection through speech and facial expressions,” 2016 3rd Int. Conf. on Computer and Information Sciences (ICCOINS), pp. 351-356, doi: 10.1109/ICCOINS.2016.7783240, 2016.
[21] J. He, C. Zhang, X. Li et al., “Survey of research on multimodal fusion technology for deep learning,” Computer Engineering, Vol.46, pp. 1-11, doi: 10.19678/j.issn.1000-3428.0057370, 2020 (in Chinese).
[22] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, Vol.521, pp. 436-444, doi: 10.1038/nature14539, 2015.
[23] R. R. Murphy, “Computer vision and machine learning in science fiction,” Science Robotics, Vol.4, pp. 7221-7235, doi: 10.1126/scirobotics.aax7421, 2019.
[24] Z. T. Liu, M. Wu, D. Y. Li, L. F. Chen, F. Dong, Y. Yamazaki, and K. Hirota, “Concept of Fuzzy Atmosfield for Representing Communication Atmosphere and its Application to Humans-Robots Interaction,” J. Adv. Comput. Intell. Intell. Inform., Vol.17, No.1, pp. 3-17, doi: 10.20965/jaciii.2013.p0003, 2013.
[25] X. Zhang and X. Wang, “Novel Survey on the Color-Image Graying Algorithm,” 2016 IEEE Int. Conf. on Computer and Information Technology (CIT), pp. 750-753, doi: 10.1109/CIT.2016.32, 2016.
[26] S. Patel and M. Goswami, “Comparative analysis of Histogram Equalization techniques,” 2014 Int. Conf. on Contemporary Computing and Informatics (IC3I), pp. 167-168, doi: 10.1109/IC3I.2014.7019808, 2014.
[27] E. D’Arca, N. M. Robertson, and J. Hopgood, “Person tracking via audio and video fusion,” 9th IET Data Fusion & Target Tracking Conf. (DF&TT 2012): Algorithms and Applications, pp. 1-6, doi: 10.1049/cp.2012.0410, 2012.
[28] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.22, No.11, pp. 1330-1334, doi: 10.1109/34.888718, 2000.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[1] [1] G. Deak, K. Curran, and J. Condell, “A survey of active and passive indoor localisation systems,” Computer Communications, Vol.35, pp. 1939-1954, doi: 10.1016/j.comcom.2012.06.004, 2012.

[2] [2] H. Lim, I. C. Yoo, Y. Cho, and D. Yook, “Speaker localization in noisy environments using steered response voice power,” IEEE Trans. on Consumer Electronics, Vol.61, No.1, pp. 112-118, doi: 10.1109/TCE.2015.7064118, 2015.

[3] [3] A. Sepas-Moghaddam, F. M. Pereira, and P. L. Correia, “Face recognition: a novel multi-level taxonomy based survey,” IET Biometrics, Vol.9, No.2, pp. 58-67, doi: 10.1049/iet-bmt.2019.0001, 2020.

[4] [4] J. Qu, H. Shi, N. Qiao, C. Wu, C. Su, and A. Razi, “New three-dimensional positioning algorithm through integrating TDOA and Newton’s method,” EURASIP J. on Wireless Communications and Networking, Article No.77, doi: 10.1186/s13638-020-01684-7, 2020.

[5] [5] A. Pourmohammad and S. M. Ahadi, “Real Time High Accuracy 3-D PHAT-Based Sound Source Localization Using a Simple 4-Microphone Arrangement,” IEEE Systems J., Vol.6, No.3, pp. 455-468, doi: 10.1109/JSYST.2011.2176766, 2012.

[6] [6] M. Gutiérrez-Muñoz, A. González-Salazar, and M. Coto-Jiménez, “Evaluation of Mixed Deep Neural Networks for Reverberant Speech Enhancement,” Biomimetics, Vol.5, doi: 10.3390/biomimetics5010001, 2019.

[7] [7] D. Blacodon and J. Bulté, “Reverberation cancellation in a closed test section of a wind tunnel using a multi-microphone cesptral method,” J. of Sound and Vibration, Vol.333, pp. 2669-2687, doi: 10.1016/j.jsv.2013.12.012, 2014.

[8] [8] M. Unoki and M. Akagi, “A method of signal extraction from noisy signal based on auditory scene analysis,” Speech Communication, Vol.27, pp. 261-279, doi: 10.1016/S0167-6393(98)00077-6, 1999.

[9] [9] S. U. Pillai, Y. Bar-Ness, and F. Haber, “A new approach to array geometry for improved spatial spectrum estimation,” Proc. of the IEEE, Vol.73, No.10, pp. 1522-1524, doi: 10.1109/PROC.1985.13324, 1985.

[10] [10] N. Roman, D. L. Wang, and G. J. Brown, “Speech segregation based on sound localization,” The J. of the Acoustical Society of America, Vol.114, pp. 2236-2252, doi: 10.1121/1.1610463, 2003.

[11] [11] D. Salvati, C. Drioli, and G. L. Foresti, “A Low-Complexity Robust Beamforming Using Diagonal Unloading for Acoustic Source Localization,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, Vol.26, No.3, pp. 609-622, doi: 10.1109/TASLP.2017.2789321, 2018.

[12] [12] Y. Guo, H. Zhu, and X. Dang, “Tracking multiple acoustic sources by adaptive fusion of TDOAs across microphone pairs,” Digital Signal Processing, Vol.106, doi: 10.1016/j.dsp.2020.102853, 2020.

[13] [13] Z. Taha, J. Y. Chew, and H. J. Yap, “Omnidirectional Vision for Mobile Robot Navigation,” J. Adv. Comput. Intell. Intell. Inform., Vol.14, No.1, pp. 55-62, doi: 10.20965/jaciii.2010.p0055, 2010.

[14] [14] Y. Maeda and W. Shimizuhira, “Multilayered Fuzzy Behavior Control for an Autonomous Mobile Robot with Multiple Omnidirectional Vision System: MOVIS,” J. Adv. Comput. Intell. Intell. Inform., Vol.11, No.1, pp. 21-27, doi: 10.20965/jaciii.2007.p0021, 2007.

[15] [15] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.20, pp. 23-28, doi: 10.1109/34.655647, 1998.

[16] [16] H. A. Rowley, S. Baluja, and T. Kanade, “Rotation invariant neural network-based face detection,” Proc. 1998 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 38-44, doi: 10.1109/CVPR.1998.698585, 1998.

[17] [17] P. Viola and M. Jones, “Robust real-time object detection,” Int. J. of Computer Vision, pp. 34-47, doi: 10.1117/12.228952, 2001.

[18] [18] M. Melek, A. Khattab, and M. F. Abu-Elyazeed, “Fast matching pursuit for sparse representation-based face recognition,” IET Image Processing, Vol.12, No.10, pp. 1807-1814, doi: 10.1049/iet-ipr.2017.1263, 2018.

[19] [19] Z. T. Liu, S. H. Li, W. H. Cao, D. Y. Li, and M. Hao, “Combining 2D gabor and local binary pattern for facial expression recognition using extreme learning machine,” J. Adv. Comput. Intell. Intell. Inform., Vol.23, No.3, pp. 444-455, doi: 10.20965/jaciii.2019.p0444, 2019.

[20] [20] K. M. Kudiri, A. M. Said, and M. Y. Nayan, “Human emotion detection through speech and facial expressions,” 2016 3rd Int. Conf. on Computer and Information Sciences (ICCOINS), pp. 351-356, doi: 10.1109/ICCOINS.2016.7783240, 2016.

[21] [21] J. He, C. Zhang, X. Li et al., “Survey of research on multimodal fusion technology for deep learning,” Computer Engineering, Vol.46, pp. 1-11, doi: 10.19678/j.issn.1000-3428.0057370, 2020 (in Chinese).

[22] [22] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, Vol.521, pp. 436-444, doi: 10.1038/nature14539, 2015.

[23] [23] R. R. Murphy, “Computer vision and machine learning in science fiction,” Science Robotics, Vol.4, pp. 7221-7235, doi: 10.1126/scirobotics.aax7421, 2019.

[24] [24] Z. T. Liu, M. Wu, D. Y. Li, L. F. Chen, F. Dong, Y. Yamazaki, and K. Hirota, “Concept of Fuzzy Atmosfield for Representing Communication Atmosphere and its Application to Humans-Robots Interaction,” J. Adv. Comput. Intell. Intell. Inform., Vol.17, No.1, pp. 3-17, doi: 10.20965/jaciii.2013.p0003, 2013.

[25] [25] X. Zhang and X. Wang, “Novel Survey on the Color-Image Graying Algorithm,” 2016 IEEE Int. Conf. on Computer and Information Technology (CIT), pp. 750-753, doi: 10.1109/CIT.2016.32, 2016.

[26] [26] S. Patel and M. Goswami, “Comparative analysis of Histogram Equalization techniques,” 2014 Int. Conf. on Contemporary Computing and Informatics (IC3I), pp. 167-168, doi: 10.1109/IC3I.2014.7019808, 2014.

[27] [27] E. D’Arca, N. M. Robertson, and J. Hopgood, “Person tracking via audio and video fusion,” 9th IET Data Fusion & Target Tracking Conf. (DF&TT 2012): Algorithms and Applications, pp. 1-6, doi: 10.1049/cp.2012.0410, 2012.

[28] [28] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol.22, No.11, pp. 1330-1334, doi: 10.1109/34.888718, 2000.

Speaker Localization Based on Audio-Visual Bimodal Fusion

Ying-Xin Zhu*,**,*** and Hao-Ran Jin*,**,***,†

Ying-Xin Zhu^,, and Hao-Ran Jin^,,,†