Skip to main content

Machine Learning and Deep Learning Approaches for Speech Emotion Recognition: A Survey

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 12 Issue: 10 | Oct 2025

p-ISSN: 2395-0072

www.irjet.net

Machine Learning and Deep Learning Approaches for Speech Emotion Recognition: A Survey Jash Shah1, Labdhi Shah2, Anish Deshpande3, Puransh Kawdia4, Prof. Pramila M Chawan5 1 1B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India 2 1B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India 31B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India

41B. Tech Student, Dept of Computer Engineering, and IT, VJTI College, Mumbai, Maharashtra, India

5Associate Professor,Dept of Computer Engineering and IT, VJTI College, Mumbai, Maharashtra, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - This paper presents a comprehensive survey of

systems for recognizing human emotions from speech signals. The ability of machines to understand emotional context is a critical step toward more natural and empathetic humancomputer interaction (HCI). This survey explores the complete pipeline of modern Speech Emotion Recognition (SER) systems, beginning with a discussion of benchmark datasets and feature extraction techniques, particularly the use of Melspectrograms. We delve into a detailed analysis of various machine learning and deep learning models, highlighting the evolution from traditional classifiers to advanced architectures like Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) networks, and state-of-theart hybrid models. The paper also investigates advanced techniques such as data augmentation with Generative Adversarial Networks (GANs) and the fusion of multiple modalities. We propose a robust SER system based on a hybrid CNN-BiLSTM architecture designed to achieve high accuracy by effectively modeling both the spectral and temporal characteristics of emotional speech.

1.2 Core Challenges in SER Despite its potential, SER faces significant challenges. Emotions are subjective and can be expressed differently across cultures, genders, and contexts. The acoustic features that correlate with emotions are often subtle and intertwined with the linguistic content of speech. Furthermore, the scarcity of large-scale, realistically labeled emotional speech datasets remains a major bottleneck, often leading to models that perform well in a lab setting but fail in real-world, noisy environments. Several key challenges are consistently identified in the literature: 

Ambiguity of Emotional Labels: There is no universally accepted standard for labeling emotions. Emotions are often blended (e.g., happily surprised) and do not have distinct boundaries, making it difficult to assign a single, discrete label to a speech utterance. This leads to inconsistencies during the data annotation process .

Data Scarcity and Quality: Most available datasets are "acted" rather than "spontaneous." While acted datasets are clean and balanced, they may not accurately reflect the subtlety of natural, real-world emotions . Spontaneous datasets are more realistic but are much harder to collect, label accurately, and are often imbalanced.

Environmental and Speaker Variability: Models trained on one dataset often fail to generalize to another due to differences in recording conditions, background noise, languages, and speaker characteristics. This problem, known as crosscorpus generalization, is a major hurdle for creating universally applicable SER systems.

Feature Engineering and Selection: While deep learning can learn features automatically, the choice of input representation is still critical. Using very

Key Words: Deep Learning, Speech Emotion Recognition (SER), CNN-BiLSTM, RAVDESS Dataset, Feature Extraction, Data Augmentation.

1.INTRODUCTION 1.1 Background and Importance Human speech is a complex signal rich with emotional information that complements its linguistic content. Speech Emotion Recognition (SER) is a rapidly advancing field at the intersection of digital signal processing and artificial intelligence, aiming to automatically identify the emotional state of a speaker from their voice. The development of accurate SER systems is paramount for the next generation of HCI. Applications are wide-ranging and impactful, including: 

Smart Assistants: Enabling virtual assistants to respond more empathetically to a user's tone.

Healthcare: Monitoring the emotional well-being of patients in remote therapy or detecting stress and depression.

© 2025, IRJET

|

Impact Factor value: 8.315

Automotive Safety: Detecting driver states like anger, drowsiness, or stress to prevent accidents.

|

ISO 9001:2008 Certified Journal

|

Page 146


Turn static files into dynamic content formats.

Create a flipbook
Machine Learning and Deep Learning Approaches for Speech Emotion Recognition: A Survey by IRJET Journal - Issuu