single-jc.php

JACIII Vol.29 No.5 pp. 1137-1144
doi: 10.20965/jaciii.2025.p1137
(2025)

Research Paper:

A Parallel CNN-Transformer Framework for Speech Age Recognition

Zheyan Zhang ORCID Icon, Renwei Li ORCID Icon, and Kewei Chen ORCID Icon

Faculty of Mechanical Engineering and Mechanics, Ningbo University
No.818 Fenghua Road, Ningbo, Zhejiang 315211, China

Corresponding author

Received:
January 9, 2025
Accepted:
May 20, 2025
Published:
September 20, 2025
Keywords:
speech age recognition, CNN, attention mechanism, spectrogram, frame shift strategy
Abstract

In human–robot interaction, personalized services can be provided for different age groups through speech age recognition, thereby enhancing the service robots’ intelligence. However, due to the diversity of human pronunciation and the similarity of voice features across different age groups, it is challenging to obtain accurate speech-based age recognition using traditional machine learning techniques. Therefore, this research proposes a parallel CNN-Transformer framework for speech age recognition, using deep learning techniques from image processing. Based on speech spectrograms, parallel CNN and Transformer branches extract local and global characteristics of the speech signal. To address data imbalance across age–gender categories, a spectrogram frame-shift strategy is additionally adopted, thereby expanding the training set and enhancing robustness. Additionally, the impact of gender on speech age recognition is discussed, and a single system for recognizing age and gender is implemented. An average accuracy of 84.9% is achieved through testing on the English subset of the Common Voice dataset to confirm the efficacy of the proposed model.

CNN-Transformer framework structure

CNN-Transformer framework structure

Cite this article as:
Z. Zhang, R. Li, and K. Chen, “A Parallel CNN-Transformer Framework for Speech Age Recognition,” J. Adv. Comput. Intell. Intell. Inform., Vol.29 No.5, pp. 1137-1144, 2025.
Data files:
References
  1. [1] R. M. Hanifa, K. Isa, and S. Mohamad, “A review on speaker recognition: Technology and challenges,” Computers & Electrical Engineering, Vol.90, Article No.107005, 2021. https://doi.org/10.1016/j.compeleceng.2021.107005
  2. [2] M. Yue, L. Chen, J. Zhang, and H. Liu, “Speaker age recognition based on isolated words by using svm,” 2014 IEEE 3rd Int. Conf. on Cloud Computing and Intelligence Systems, pp. 282-286, 2014. https://doi.org/10.1109/CCIS.2014.7175743
  3. [3] O. T.-C. Chen and J. J. Gu, “Improved gender/age recognition system using arousal-selection and feature-selection schemes,” 2015 IEEE Int. Conf. on Digital Signal Processing (DSP), pp. 148-152, 2015. https://doi.org/10.1109/ICDSP.2015.7251848
  4. [4] X.-N. Du and Y.-B. Yu, “Multi resolution feature extraction of effective frequency bands for age recognition,” J. of Signal Processing, Vol.32, No.9, pp. 1101-1107, 2016. https://doi.org/10.16798/j.issn.1003-0530.2016.09.13
  5. [5] M. H. Bahari, M. McLaren, and D. A. van Leeuwen, “Speaker age estimation using i-vectors,” Engineering Applications of Artificial Intelligence, Vol.34, pp. 99-108, 2014. https://doi.org/10.1016/j.engappai.2014.05.003
  6. [6] P. Ghahremani, P. S. Nidadavolu, N. Chen, J. Villalba, D. Povey, S. Khudanpur, and N. Dehak, “End-to-end deep neural network age estimation,” Proc. of Interspeech 2018, pp. 277-281, 2018.
  7. [7] A. Tursunov, Mustaqeem, J. Y. Choeh et al., “Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms,” Sensors, Vol.21, No.17, Article No.5892, 2021. https://doi.org/10.3390/s21175892
  8. [8] H. A. Sánchez-Hevia, R. Gil-Pita, M. Utrilla-Manso, and M. Rosa-Zurera, “Age group classification and gender recognition from speech with temporal convolutional neural networks,” Multimedia Tools and Applications, Vol.81, No.3, pp. 3535-3552, 2022. https://doi.org/10.1007/s11042-021-11614-4
  9. [9] S. Mavaddati, “Voice-based age, gender, and language recognition based on resnet deep model and transfer learning in spectro-temporal domain,” Neurocomputing, Vol.580, Article No.127429, 2024. https://doi.org/10.1016/j.neucom.2024.127429
  10. [10] A. Akan and O. K. Cura, “Time–frequency signal processing: Today and future,” Digital Signal Processing, Vol.119, Article No.103216, 2021. https://doi.org/10.1016/j.dsp.2021.103216
  11. [11] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention is all you need,” Advances in Neural Information Processing Systems, Vol.30, 2017.
  12. [12] M. A. Anusuya and S. K. Katti, “Front end analysis of speech recognition: A review,” Int. J. of Speech Technology, Vol.14, pp. 99-145, 2011. https://doi.org/10.1007/s10772-010-9088-7
  13. [13] R. Ardila, M. Branson, K. Davis, M. Kohler, J. Meyer, M. Henretty, R. Morais, L. Saunders, F. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” Proc. of the 12th Language Resources and Evaluation Conference (LREC), pp. 4218-4222, 2020.
  14. [14] P. Torre III and J. A. Barlow, “Age-related changes in acoustic characteristics of adult speech,” J. of Communication Disorders, Vol.42, No.5, pp. 324-333, 2009. https://doi.org/10.1016/j.jcomdis.2009.03.001

*This site is desgined based on HTML5 and CSS3 for modern browsers, e.g. Chrome, Firefox, Safari, Edge, Opera.

Last updated on Sep. 19, 2025