Speech Emotion Recognition Using Hybrid Deep Learning and Ensemble Approaches

International Journal of Electronics and Communication Engineering |
© 2025 by SSRG - IJECE Journal |
Volume 12 Issue 1 |
Year of Publication : 2025 |
Authors : I. Manolekshmi, M.A. Mukunthan |
How to Cite?
I. Manolekshmi, M.A. Mukunthan, "Speech Emotion Recognition Using Hybrid Deep Learning and Ensemble Approaches," SSRG International Journal of Electronics and Communication Engineering, vol. 12, no. 1, pp. 216-235, 2025. Crossref, https://doi.org/10.14445/23488549/IJECE-V12I1P117
Abstract:
The technique of recognizing and classifying emotions expressed in language spoken using audio features is Speech Emotion Recognition (SER). Human-computer interaction must enable machines to accurately perceive and respond to human emotions. Numerous challenges, like capturing both spatial and temporal features in speech signals, impact the accuracy of emotion recognition models. Conventional emotion recognition systems heavily depend on manual feature extraction and classification, which require significant effort and often lead to errors in detection. Advances in image processing and Artificial Intelligence (AI) have introduced hybrid Deep Learning (DL) approaches to improve SER tasks. This study developed an efficient Speech Emotion Recognition (SER) system utilizing a hybrid DL model combined with an ensemble approach to accurately classify emotions expressed through speech. The models were evaluated on the CREMA dataset which contains 7,442 audio samples across six different emotions. After preprocessing and data augmentation, Mel Frequency Cepstral Coefficients (MFCC) were captured as features from speech data. The proposed models include CNN-LSTM and CNN-GRU to extract both spatial and temporal features. Outputs from these frameworks were combined using an ensemble learning approach with a Support Vector Machine (SVM) classifier as the meta-learner. Experimental results specify that the suggested model attained improved performance with an accuracy of 98.69%, precision of 98.70%, recall of 98.72% and an F1 score of 98.70%. The results highlight the effectiveness of combining advanced neural networks for achieving high performance in emotion detection from speech signals, providing valuable information for developing real-time emotion recognition systems and enhancing human-computer interaction.
Keywords:
Speech emotion recognition, Support vector machine, MFCC, Convolutional neural network, CREMA dataset, Ensemble learning.
References:
[1] Abdul Malik Badshah et al., “Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network,” International Conference on Platform Technology and Service, Busan, Korea, pp. 1-5, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Dias Issa, M. Fatih Demirci, and Adnan Yazici, “Speech Emotion Recognition with Deep Convolutional Neural Networks,” Biomedical Signal Processing and Control, vol. 59, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Abdelaziz A. Abdelhamid et al., “Robust Speech Emotion Recognition Using CNN+ LSTM Based on Stochastic Fractal Search Optimization Algorithm,” IEEE Access, vol. 10, pp. 49265-49284, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Kishor Bhangale, and Mohanaprasad Kothandaraman, “Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network,” Electronics, vol. 12, no. 4, pp. 1-17, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Samuel Kakuba, Alwin Poulose, and Dong Seog Han, “Attention-Based Multi-Learning Approach for Speech Emotion Recognition with Dilated Convolution,” IEEE Access, vol. 10, pp. 122302-122313, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Arya Aftab et al., “Light-Sernet: A Lightweight Fully Convolutional Neural Network for Speech Emotion Recognition,” IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, pp. 6912-6916, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Apeksha Aggarwal et al., “Two-Way Feature Extraction for Speech Emotion Recognition Using Deep Learning,” Sensors, vol. 22, no. 6, pp. 1-11, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Lu-Qiao Li et al., “Emotion Recognition from Speech with StarGAN and Dense-DCNN,” IET Signal Processing, vol. 16, no. 1, pp. 62-79, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Soonil Kwon Mustaqeem, “MLT-DNet: Speech Emotion Recognition using 1D Dilated CNN Based on Multi-Learning Trick Approach,” Expert Systems with Applications, vol. 167, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Ziping Zhao et al., “Combining a Parallel 2D CNN With a Self-Attention Dilated Residual Network for CTC-Based Discrete Speech Emotion Recognition,” Neural Networks, vol. 141, pp. 52-60, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Ammar Amjad, Lal Khan, and Hsien-Tsung Chang, “Effect on Speech Emotion Classification of a Feature Selection Approach Using a Convolutional Neural Network,” PeerJ Computer Science, vol. 7, pp. 1-28, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Orhan Atila, and Abdulkadir Şengür, “Attention Guided 3D CNN-LSTM Model for Accurate Speech-Based Emotion Recognition,” Applied Acoustics, vol. 182, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Turker Tuncer, Sengul Dogan, and U. Rajendra Acharya, “Automated Accurate Speech Emotion Recognition System Using Twine Shuffle Pattern and Iterative Neighborhood Component Analysis Techniques,” Knowledge-Based Systems, vol. 211, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Jianyou Wang et al., “Speech Emotion Recognition with Dual-Sequence LSTM Architecture,” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, pp. 6474-6478, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Anish Nediyanchath, Periyasamy Paramasivam, and Promod Yenigalla, “Multi-Head Attention for Speech Emotion Recognition with Auxiliary Learning of Gender Recognition,” ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7179-7183, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Zengwei Yao et al., “Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-based Classifiers: HSF-DNN, MS-CNN and LLD-RNN,” Speech Communication, vol. 120, pp. 11-19, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Mustaqeem, Muhammad Sajjad, and Soonil Kwon, “Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM,” IEEE Access, vol. 8, pp. 79861-79875, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Misbah Farooq et al., “Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network,” Sensors, vol. 20, no. 21, pp. 1-18, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Crowd Sourced Emotional Multimodal Actors Dataset (CREMA-D), Kaggle. [Online]. Available: https://www.kaggle.com/datasets/ejlok1/cremad
[20] Kharibam Jilenkumari Devi, Ayekpam Alice Devi, and Khelchandra Thongam, “Automatic Speaker Recognition using MFCC and Artificial Neural Network,” International Journal of Innovative Technology and Exploring Engineering, vol. 9, pp. 39-42, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Md. Zahangir Alom et al., “A State-of-the-Art Survey on Deep Learning Theory and Architectures,” Electronics, vol. 8, no. 3, pp. 1-66, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[22] M. Kalpana Chowdary, J. Anitha, and D. Jude Hemanth, “Emotion Recognition from EEG Signals Using Recurrent Neural Networks,” Electronics, vol. 11, no. 15, pp. 1-20, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Iram Bibi et al., “A Dynamic DL-Driven Architecture to Combat Sophisticated Android Malware,” IEEE Access, vol. 8, pp. 129600-129612, 2020.
[CrossRef] [Google Scholar] [Publisher Link]