MERMHS: A Multimodal Emotion Recognition Framework Using Probability- Based Late Fusion for Mental Health Monitoring

Yellamma Pachipala, Dhanush Vardhan Yalamati, Pavan Kumar Karubhuktha, Gayathri Jagarlamudi, Pavani Challa

Received	Revised	Accepted	Published
11 Jan 2026	12 Feb 2026	15 Mar 2026	30 Apr 2026

Citation :

Yellamma Pachipala, Dhanush Vardhan Yalamati, Pavan Kumar Karubhuktha, Gayathri Jagarlamudi, Pavani Challa, "MERMHS: A Multimodal Emotion Recognition Framework Using Probability- Based Late Fusion for Mental Health Monitoring," International Journal of Electronics and Communication Engineering, vol. 13, no. 4, pp. 183-193, 2026. Crossref, https://doi.org/10.14445/23488549/IJECE-V13I4P114

Abstract

Mental health issues are now more common than ever, and a need arises to have strong, intelligent systems that can accurately identify human emotional states. Although Artificial intelligence provides advanced methodologies, many existing systems are facing challenges in achieving the best accuracy in human emotion detection in the real world. This is due to variations in facial expressions, background noise in speech signals, contextual ambiguity in textual inputs, and low-performance fusion techniques. The proposed work is a Multimodal Emotion Recognition Mental Health System (MERMHS) that aims to narrow this gap, which uses CNN for video-based face expression recognition, LSTM is applied for audio emotion recognition through Speech signals, and Bi-LSTM is utilised for text emotion recognition through textual inputs. The investigation shows that the proposed MERMHS approach by using the CMU-MOSEI dataset achieves an accuracy of 92.7%, a precision of 93.70%, a recall of 92.67%, and an F1-score of 93.10%. Compared with the existing approach, the proposed MERMHS is superior because of the probability-based late fusion technique.

Keywords

Facial Emotion Detection, Text Emotion Detection, Speech Emotion Detection, Multimodal Emotion Recognition, CNN, Bi-LSTM, LSTM.

References

Qianhe Ouyang, “Speech Emotion Detection based on MFCC and CNN-LSTM Architecture,” Applied and Computational Engineering, vol. 5, pp. 243-249, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
Shamin Bin Habib Avro, Taieba Taher, and Nursadul Mamun, “EmoTech: A Multi-Modal Speech Emotion Recognition Using Multi-Source Low-Level Information with Hybrid Recurrent Network,” 2024 IEEE International Conference on Signal Processing, Information, Communication and Systems, Khulna, Bangladesh, pp. 1-5, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Guowei Zhong, et al., “Calibrating Multimodal Consensus for Emotion Recognition,” arXiv preprint, vol. 14, no. 8, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Shuo Zhang et al., “Multimodal Mixture of Low-Rank Experts for Sentiment Analysis and Emotion Recognition,” 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, pp. 1-6, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Chengyan Wu et al., “Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects,” arXiv preprint, pp. 1-18, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Prashant Kumar Nag, Amit Bhagat, and R. Vishnu Priya, “Expanding AI’s Role in Healthcare Applications: A Systematic Review of Emotional and Cognitive Analysis Techniques,” IEEE Access, vol. 13, pp. 69129-69160, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Puneet Kumar et al., “VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 16, no. 4, pp. 2881-2893, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Yehun Song, and Sunyoung Cho, “Leveraging CLIP Encoder for Multimodal Emotion Recognition,” 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, pp. 6115-6124, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Konstantinos Mountzouris, Isidoros Perikos, and Ioannis Hatzilygeroudis, “Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism,” Electronics, vol. 12, no. 20, pp. 1-31, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
Utsav Poudel et al., “AI in Mental Health: A Review of Technological Advancements and Ethical Issues in Psychiatry,” Issues in Mental Health Nursing, vol. 46, no. 7, pp. 693-701, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Feiyang Chen et al., “Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis,” arXiv preprint, pp. 1-9, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
Sanmay Kotkar, “Real-Time Emotion Recognition with CNN and LSTM,” Preprints, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Fazliddin Makhmudov, Alpamis Kultimuratov, and Young-Im Cho, “Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures,” Applied Sciences, vol. 14, no. 10, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
Dilnoza Mamieva et al., “Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features,” Sensors, vol. 23, no. 12, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
Trisha Mittal et al., “M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 2, pp. 1359-1367, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
Zhijing Xu, and Yang Gao, “Research on Cross-Modal Emotion Recognition based on Multi-Layer Semantic Fusion,” Mathematical Biosciences and Engineering, vol. 21, no. 2, pp. 2488-2514, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
Ngumimi Karen Iyortsuun et al., “A Review of Machine Learning and Deep Learning Approaches on Mental Health Diagnosis,” Healthcare, vol. 11, no. 3, pp. 1-27, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
Masab A. Mansoor, and Kashif H. Ansari, “Early Detection of Mental Health Crises through Artificial-Intelligence-Powered Social Media Analysis: A Prospective Observational Study,” Journal of Personalized Medicine, vol. 14, no. 9, pp. 1-11, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
Jinghui Qin et al., “Mental-Perceiver: Audio-Textual Multi-Modal Learning for Estimating Mental Disorders,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, pp. 1-9, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
Nafiseh Ghaffar Nia, Erkan Kaplanoglu, and Ahad Nasab, “Evaluation of Artificial Intelligence Techniques in Disease Diagnosis and Prediction,” Discover Artificial Intelligence, vol. 3, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[1] Qianhe Ouyang, “Speech Emotion Detection based on MFCC and CNN-LSTM Architecture,” Applied and Computational Engineering, vol. 5, pp. 243-249, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Shamin Bin Habib Avro, Taieba Taher, and Nursadul Mamun, “EmoTech: A Multi-Modal Speech Emotion Recognition Using Multi-Source Low-Level Information with Hybrid Recurrent Network,” 2024 IEEE International Conference on Signal Processing, Information, Communication and Systems, Khulna, Bangladesh, pp. 1-5, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Guowei Zhong, et al., “Calibrating Multimodal Consensus for Emotion Recognition,” arXiv preprint, vol. 14, no. 8, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Shuo Zhang et al., “Multimodal Mixture of Low-Rank Experts for Sentiment Analysis and Emotion Recognition,” 2025 IEEE International Conference on Multimedia and Expo (ICME), Nantes, France, pp. 1-6, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Chengyan Wu et al., “Multimodal Emotion Recognition in Conversations: A Survey of Methods, Trends, Challenges and Prospects,” arXiv preprint, pp. 1-18, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Prashant Kumar Nag, Amit Bhagat, and R. Vishnu Priya, “Expanding AI’s Role in Healthcare Applications: A Systematic Review of Emotional and Cognitive Analysis Techniques,” IEEE Access, vol. 13, pp. 69129-69160, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Puneet Kumar et al., “VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition,” IEEE Transactions on Affective Computing, vol. 16, no. 4, pp. 2881-2893, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Yehun Song, and Sunyoung Cho, “Leveraging CLIP Encoder for Multimodal Emotion Recognition,” 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, pp. 6115-6124, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Konstantinos Mountzouris, Isidoros Perikos, and Ioannis Hatzilygeroudis, “Speech Emotion Recognition Using Convolutional Neural Networks with Attention Mechanism,” Electronics, vol. 12, no. 20, pp. 1-31, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Utsav Poudel et al., “AI in Mental Health: A Review of Technological Advancements and Ethical Issues in Psychiatry,” Issues in Mental Health Nursing, vol. 46, no. 7, pp. 693-701, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Feiyang Chen et al., “Complementary Fusion of Multi-Features and Multi-Modalities in Sentiment Analysis,” arXiv preprint, pp. 1-9, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Sanmay Kotkar, “Real-Time Emotion Recognition with CNN and LSTM,” Preprints, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Fazliddin Makhmudov, Alpamis Kultimuratov, and Young-Im Cho, “Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures,” Applied Sciences, vol. 14, no. 10, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Dilnoza Mamieva et al., “Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features,” Sensors, vol. 23, no. 12, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Trisha Mittal et al., “M3ER: Multiplicative Multimodal Emotion Recognition Using Facial, Textual, and Speech Cues,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 2, pp. 1359-1367, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Zhijing Xu, and Yang Gao, “Research on Cross-Modal Emotion Recognition based on Multi-Layer Semantic Fusion,” Mathematical Biosciences and Engineering, vol. 21, no. 2, pp. 2488-2514, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Ngumimi Karen Iyortsuun et al., “A Review of Machine Learning and Deep Learning Approaches on Mental Health Diagnosis,” Healthcare, vol. 11, no. 3, pp. 1-27, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Masab A. Mansoor, and Kashif H. Ansari, “Early Detection of Mental Health Crises through Artificial-Intelligence-Powered Social Media Analysis: A Prospective Observational Study,” Journal of Personalized Medicine, vol. 14, no. 9, pp. 1-11, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Jinghui Qin et al., “Mental-Perceiver: Audio-Textual Multi-Modal Learning for Estimating Mental Disorders,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, pp. 1-9, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Nafiseh Ghaffar Nia, Erkan Kaplanoglu, and Ahad Nasab, “Evaluation of Artificial Intelligence Techniques in Disease Diagnosis and Prediction,” Discover Artificial Intelligence, vol. 3, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher Link]