Optimizing HMM Speech Synthesis for Low-Resource Devices

Bálint Tóth; Géza Németh

doi:10.20965/jaciii.2012.p0327

single-jc.php

« previous

JACIII Vol.16 No.2 pp. 327-334

(2012)

doi: 10.20965/jaciii.2012.p0327

Paper:

Views over last 60 days: 1,008

Optimizing HMM Speech Synthesis for Low-Resource Devices

Bálint Tóth and Géza Németh

Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Magyar tudósok krt. 2, Budapest 1117, Hungary

Received:

September 15, 2011

Accepted:

November 15, 2011

Published:

March 20, 2012

Keywords:

text-to-speech, speech synthesis, hidden Markov model, mobile devices, low resource devices

Abstract

Speech synthesis can be an importantmodality in Cognitive Infocommunications (CogInfoCom). Speech output is beneficial when the visual output of a system is blocked or is difficult to reach. Extra information can be added to output by applying different voice characteristics and emotional speech. CogInfo-Com systems can use low-resource devices in many cases. This paper describes the application of Hidden Markov Model (HMM) based speech synthesis to such systems. Several optimization steps, e.g., changing HMM parameters, applying performance-specific programming methods, are analyzed on three different smartphones in terms of speed, footprint size, and subjective speech quality. The goal is to approach realtime functionality while keeping the speech quality as high as possible. Successful optimization steps and resource-dependent optimal settings are introduced.

Cite this article as:

B. Tóth and G. Németh, “Optimizing HMM Speech Synthesis for Low-Resource Devices,” J. Adv. Comput. Intell. Intell. Inform., Vol.16 No.2, pp. 327-334, 2012.

Data files:

References

[1] P. Baranyi and A. Csapó, “Cognitive infocommunications: Coginfocom,” Computational Intelligence and Informatics, 11th Int. Symposium on Computational Intelligence and Informatics, pp. 141-146, 2010.
[2] M. Pleva, S. Ondas, J. Juhar, A. Cizmar, A. J. Papaj, and L. Dobos, “Speech and mobile technologies for cognitive communication and information systems,” 2nd Int. Conf. on Cognitive Infocommunications (CogInfoCom), pp. 1-5, July 2011.
[3] C. B. Dickson, “Text to Speech Interactive Voice Response System,” J. of the Acoustical Society of America, Vol.130, Issue 2, pp. 1087-1088, 2011.
[4] G. Németh, G. Kiss, and B. Toth, “Cross Platform Solution of Communication and Voice/Graphical User Interface for Mobile Devices in Vehicles,” In: H. Abut, J. H. L. Hansen, and K. Takeda (Eds.), Advances for In-Vehicle and Mobile Systems: Challenges for International Standards, Springer, pp. 237-250, 2007.
[5] M. C. Buzzi, M. Buzzi, B. Leporini, G.Mori, and V.M. R. Penichet, “Accessing Google Docs via Screen Reader,” Proc. of the 12th Int. Conf. on Computers Helping People with Special Needs, No.1, pp. 92-99, 2010.
[6] G. Németh, G. Olaszy, and T. Csapó, “Spemoticons: Text-To-Speech based emotional auditory cues,” 17th Annual Conf. on Auditory Display, pp. 1-7, June 2011.
[7] D. H. Klatt and L. C. Klatt, “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” The J. of the Acoustical Society of America, Vol.87, Issue 2, pp. 820-857, February 1990.
[8] D. O’Shaughnessy, L. Barbeau, D. Bernardi, and D. Archambault, “Diphone speech synthesis,” Speech Communications, Vol.7, Issue 1, pp. 55-65, March 1988.
[9] B. Möbius, “Corpus-based speech synthesis: methods and challenges,” In W. F. Sendlmeier (Ed.), Speech and Signals – Aspects of Speech Synthesis and Automatic Speech Recognition, pp. 79-96, 2000.
[10] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, Vol.51, No.11, pp. 1039-1064, November 2009.
[11] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. of the IEEE, pp. 257-286, 1989.
[12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Mixed excitation for HMM-based speech Synthesis,” Proc. of Eurospeech, pp. 2259-2262, September 2001.
[13] W. C. Chu, “Speech Coding Algorithms: Foundation and Evolution of Standardized Coders,” Wiley-Interscience, pp. 91-143, 2003. ISBN: 978-0471373124
[14] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “HNM-based MFCC+F0 extractor applied to statistical speech synthesis,” Proc. of ICASSP, pp. 4728-4731, May 2011.
[15] S. Imai, K. Sumita, and C. Furuichi, “Mel log spectrum approximation (MLSA) filter for speech synthesis,” Trans. IEICE, Vol.J66-A, pp. 122-129, 1983.
[16] M. S. Hawley, P. Green, P. Enderby, S. K. Cunningham, and R. Moore, “Speech Technology for e-Inclusion of People with Physical Disabilities and Disordered Speech,” Proc. of Interspeech 2005, pp. 445-448, 2005.
[17] A.W. Black and K. A. Lenzo, “Flite: a small fast run-time synthesis engine,” Proc. of 4th ISCA ETRW on Speech Synthesis, pp. 157-162, Pitlochry, Scotland, August 29 to September 1, 2001.
[18] S. Karabetsos, P. Tsiakoulis, A. Chalamandaris, and S. Raptis, “Embedded unit selection text-to-speech synthesis for mobile devices,” IEEE Trans. on Consumer Electronics, Vol.55 Issue 2, pp. 613-621, May 2009.
[19] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. of ICASSP 2000, pp. 1315-1318, June 2000.
[20] S.-J. Kim, J.-J. Kim, and M.-S. Hahn, “HMM-based Korean speech synthesis system for hand-held devices,” IEEE Trans. Consumer Electronics, Vol.52, No.4, pp. 1384-1390, 2006.
[21] SVOX AG, 2008, SVOX releases Pico: highest-quality sub-1MB TTS, Press release.
http://www.svox.com/upload/pdf/PR SVOX Pico Release No08.pdf
[22] K. Oura, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems,” Proc. of Interspeech 2009, pp. 1759-1762, Brighton, UK, September 2009.
[23] N. Sugamura and F. Itakura, “Speech data compression by LSP speech analysis-synthesis technique,” Trans. IEICE, Vol.J 64-A, No.8, pp. 599-606, 1981.
[24] K. Shinoda and T. Watanabe, “MDL-based context-dependent subword modeling for speech recognition,” J. of the Acoustical Society of Japan, Vol.21, No.2, pp. 79-86, 2001.
[25] C. Shi and R. W. Brodersen, “An automated floating-point to fixedpoint conversion methodology,” Proc. of ICASSP, Vol.2, pp. 529-532, April 2003.
[26] H. Zen, K. Oura, T. Nose, J. Yamagishi, S. Sako, T. Toda, T. Masuko, A. W. Black, and K. Tokuda, “Recent development of the HMM-based speech synthesis system (HTS),” In Proc. of Asia-Pacific Signal and Information Processing Association, pp. 121-130, Sapporo, Japan, October 2009.
[27] J. Kominek and A. W. Black, “The CMU Arctic speech databases,” Proc. of 5th ISCA Speech Synthesis Workshop, pp. 223-224, Pittsburgh, PA, 2004.
[28] B. Tóth and G. Németh, “Hidden Markov model based speech synthesis system in Hungarian,” Infocommunications J., Vol.LXIII, No.2008/7, pp. 30-34, 2008.
[29] M. Fék, P. Pesti, G. Németh, Cs. Zainkó, and G. Olaszy, “Corpus-Based Unit Selection TTS for Hungarian,” Proc. of Text, Speech and Dialogue, pp. 367-373, Brno, Czech Republic, September 11-15, 2006.

This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.

[B1] [1] P. Baranyi and A. Csapó, “Cognitive infocommunications: Coginfocom,” Computational Intelligence and Informatics, 11th Int. Symposium on Computational Intelligence and Informatics, pp. 141-146, 2010.

[B2] [2] M. Pleva, S. Ondas, J. Juhar, A. Cizmar, A. J. Papaj, and L. Dobos, “Speech and mobile technologies for cognitive communication and information systems,” 2nd Int. Conf. on Cognitive Infocommunications (CogInfoCom), pp. 1-5, July 2011.

[B3] [3] C. B. Dickson, “Text to Speech Interactive Voice Response System,” J. of the Acoustical Society of America, Vol.130, Issue 2, pp. 1087-1088, 2011.

[B4] [4] G. Németh, G. Kiss, and B. Toth, “Cross Platform Solution of Communication and Voice/Graphical User Interface for Mobile Devices in Vehicles,” In: H. Abut, J. H. L. Hansen, and K. Takeda (Eds.), Advances for In-Vehicle and Mobile Systems: Challenges for International Standards, Springer, pp. 237-250, 2007.

[B5] [5] M. C. Buzzi, M. Buzzi, B. Leporini, G.Mori, and V.M. R. Penichet, “Accessing Google Docs via Screen Reader,” Proc. of the 12th Int. Conf. on Computers Helping People with Special Needs, No.1, pp. 92-99, 2010.

[B6] [6] G. Németh, G. Olaszy, and T. Csapó, “Spemoticons: Text-To-Speech based emotional auditory cues,” 17th Annual Conf. on Auditory Display, pp. 1-7, June 2011.

[B7] [7] D. H. Klatt and L. C. Klatt, “Analysis, synthesis, and perception of voice quality variations among female and male talkers,” The J. of the Acoustical Society of America, Vol.87, Issue 2, pp. 820-857, February 1990.

[B8] [8] D. O’Shaughnessy, L. Barbeau, D. Bernardi, and D. Archambault, “Diphone speech synthesis,” Speech Communications, Vol.7, Issue 1, pp. 55-65, March 1988.

[B9] [9] B. Möbius, “Corpus-based speech synthesis: methods and challenges,” In W. F. Sendlmeier (Ed.), Speech and Signals – Aspects of Speech Synthesis and Automatic Speech Recognition, pp. 79-96, 2000.

[B10] [10] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric speech synthesis,” Speech Communication, Vol.51, No.11, pp. 1039-1064, November 2009.

[B11] [11] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. of the IEEE, pp. 257-286, 1989.

[B12] [12] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Mixed excitation for HMM-based speech Synthesis,” Proc. of Eurospeech, pp. 2259-2262, September 2001.

[B13] [13] W. C. Chu, “Speech Coding Algorithms: Foundation and Evolution of Standardized Coders,” Wiley-Interscience, pp. 91-143, 2003. ISBN: 978-0471373124

[B14] [14] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “HNM-based MFCC+F0 extractor applied to statistical speech synthesis,” Proc. of ICASSP, pp. 4728-4731, May 2011.

[B15] [15] S. Imai, K. Sumita, and C. Furuichi, “Mel log spectrum approximation (MLSA) filter for speech synthesis,” Trans. IEICE, Vol.J66-A, pp. 122-129, 1983.

[B16] [16] M. S. Hawley, P. Green, P. Enderby, S. K. Cunningham, and R. Moore, “Speech Technology for e-Inclusion of People with Physical Disabilities and Disordered Speech,” Proc. of Interspeech 2005, pp. 445-448, 2005.

[B17] [17] A.W. Black and K. A. Lenzo, “Flite: a small fast run-time synthesis engine,” Proc. of 4th ISCA ETRW on Speech Synthesis, pp. 157-162, Pitlochry, Scotland, August 29 to September 1, 2001.

[B18] [18] S. Karabetsos, P. Tsiakoulis, A. Chalamandaris, and S. Raptis, “Embedded unit selection text-to-speech synthesis for mobile devices,” IEEE Trans. on Consumer Electronics, Vol.55 Issue 2, pp. 613-621, May 2009.

[B19] [19] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” Proc. of ICASSP 2000, pp. 1315-1318, June 2000.

[B20] [20] S.-J. Kim, J.-J. Kim, and M.-S. Hahn, “HMM-based Korean speech synthesis system for hand-held devices,” IEEE Trans. Consumer Electronics, Vol.52, No.4, pp. 1384-1390, 2006.

[B21] [21] SVOX AG, 2008, SVOX releases Pico: highest-quality sub-1MB TTS, Press release.
http://www.svox.com/upload/pdf/PR SVOX Pico Release No08.pdf

[B22] [22] K. Oura, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, “Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems,” Proc. of Interspeech 2009, pp. 1759-1762, Brighton, UK, September 2009.

[B23] [23] N. Sugamura and F. Itakura, “Speech data compression by LSP speech analysis-synthesis technique,” Trans. IEICE, Vol.J 64-A, No.8, pp. 599-606, 1981.

[B24] [24] K. Shinoda and T. Watanabe, “MDL-based context-dependent subword modeling for speech recognition,” J. of the Acoustical Society of Japan, Vol.21, No.2, pp. 79-86, 2001.

[B25] [25] C. Shi and R. W. Brodersen, “An automated floating-point to fixedpoint conversion methodology,” Proc. of ICASSP, Vol.2, pp. 529-532, April 2003.

[B26] [26] H. Zen, K. Oura, T. Nose, J. Yamagishi, S. Sako, T. Toda, T. Masuko, A. W. Black, and K. Tokuda, “Recent development of the HMM-based speech synthesis system (HTS),” In Proc. of Asia-Pacific Signal and Information Processing Association, pp. 121-130, Sapporo, Japan, October 2009.

[B27] [27] J. Kominek and A. W. Black, “The CMU Arctic speech databases,” Proc. of 5th ISCA Speech Synthesis Workshop, pp. 223-224, Pittsburgh, PA, 2004.

[B28] [28] B. Tóth and G. Németh, “Hidden Markov model based speech synthesis system in Hungarian,” Infocommunications J., Vol.LXIII, No.2008/7, pp. 30-34, 2008.

[B29] [29] M. Fék, P. Pesti, G. Németh, Cs. Zainkó, and G. Olaszy, “Corpus-Based Unit Selection TTS for Hungarian,” Proc. of Text, Speech and Dialogue, pp. 367-373, Brno, Czech Republic, September 11-15, 2006.