A Parallel CNN-Transformer Framework for Speech Age Recognition

Zheyan Zhang; Renwei Li; Kewei Chen

doi:10.20965/jaciii.2025.p1137

Abstract

In human–robot interaction, personalized services can be provided for different age groups through speech age recognition, thereby enhancing the service robots’ intelligence. However, due to the diversity of human pronunciation and the similarity of voice features across different age groups, it is challenging to obtain accurate speech-based age recognition using traditional machine learning techniques. Therefore, this research proposes a parallel CNN-Transformer framework for speech age recognition, using deep learning techniques from image processing. Based on speech spectrograms, parallel CNN and Transformer branches extract local and global characteristics of the speech signal. To address data imbalance across age–gender categories, a spectrogram frame-shift strategy is additionally adopted, thereby expanding the training set and enhancing robustness. Additionally, the impact of gender on speech age recognition is discussed, and a single system for recognizing age and gender is implemented. An average accuracy of 84.9% is achieved through testing on the English subset of the Common Voice dataset to confirm the efficacy of the proposed model.

Content from these authors

This article cannot obtain the latest cited-by information.

This article is licensed under a Creative Commons [Attribution-NoDerivatives 4.0 International] license (https://creativecommons.org/licenses/by-nd/4.0/).
The journal is fully Open Access under Creative Commons licenses and all articles are free to access at JACIII official website.
https://www.fujipress.jp/jaciii/jc-about/#https://creativecommons.org/licenses/by-nd

Favorites & Alerts

Corresponding author

Register with J-STAGE for free!