Issuu

International Research Journal of Engineering and Technology (IRJET) Volume: 12 Issue: 10 | Oct 2025

www.irjet.net

e-ISSN: 2395-0056 p-ISSN: 2395-0072

MoodBeats: Engineering Challenges and Solutions in Real-Time Multimodal Emotion Recognition for Music Recommendation Manish Bhavar1, Tanvi Dogmane2, Vishal Taskar3, Dr. Vipin Borole4 1,2,3Student, MCA In Management, MET’s Institute of Management, BKC, Nashik, Maharashtra, India 4Assistant Prof., Dept. Of MCA, MET’s Institute of Management, BKC, Nashik, Maharashtra, India

---------------------------------------------------------------------***--------------------------------------------------------------------surveys confirm that multimodal approaches consistently Abstract - Multimodal emotion recognition for music

outperform unimodal alternatives, establishing this as a mature research direction with well-documented benefits [2]. The integration of such systems with music recommendation platforms represents a natural application domain that has attracted both academic and commercial interest. However, a significant gap exists between research prototypes and deployable systems. While the theoretical foundations and architectural approaches are well-established, the practical challenges of real-time deployment in streaming applications remain inadequately addressed. These challenges include strict latency requirements, environmental robustness, privacy considerations, and integration with commercial APIs that impose significant constraints on system design.

recommendations has emerged as a well-established research area, attracting significant interest in recent years. This paper presents MoodBeats, an engineering implementation designed to address key challenges in deploying real-time multimodal emotion recognition systems within practical music streaming environments. Rather than introducing new model architectures, this work emphasizes engineering optimization to meet real-world constraints. The system integrates three established modalities facial expression analysis using convolutional neural networks, voice emotion recognition through spectral feature extraction, and text sentiment analysis using transformer-based models combined within an attention-based fusion framework. The proposed approach focuses on achieving sub-150 millisecond end-to-end latency, maintaining robustness under varying environmental conditions, and ensuring privacy-preserving local data processing. Additional engineering efforts include efficient API integration and quota management for streaming services. Experimental evaluations demonstrate consistent latency of approximately 127 milliseconds across diverse conditions, validating the real-time performance of the system. Comparative studies show notable improvements in user satisfaction scores (4.3 out of 5.0) over baseline unimodal systems (3.8 out of 5.0). The results highlight the practical significance of systematic engineering in balancing accuracy, latency, and robustness for multimodal emotion recognition. This work contributes insights into the deployment of multimodal emotion-aware systems for intelligent, user-centric music recommendation applications.

1.2 Problem Statement and Engineering Focus This paper addresses the engineering challenges inherent in deploying multimodal emotion recognition systems for music recommendation, specifically focusing on the transition from research prototype to practical implementation. Unlike previous work that proposes new architectural approaches, our focus is on solving the technical problems that emerge when established techniques must meet the demanding requirements of real-world deployment. The primary engineering challenges we address include: 1. Latency Optimization: Achieving sub-150ms end-to-end processing time required for seamless user experience 2. Environmental Robustness: Handling varying lighting conditions, background noise, and partial occlusions that compromise individual modalities

Key Words: Multimodal Systems Engineering, Real-time Processing, Emotion Recognition Implementation, Music Recommendation, Privacy-Preserving Computing, API Integration

3. Privacy-Preserving Processing: Implementing local processing architectures that minimize transmission of sensitive biometric data

1.INTRODUCTION

4. API Integration and Quota Management: Working within the constraints of commercial music streaming APIs

1.1 Context and Motivation The field of multimodal emotion recognition has evolved rapidly, with numerous research efforts demonstrating the effectiveness of combining facial expressions, voice analysis, and text sentiment for understanding human emotional states [1]. Recent

Impact Factor value: 8.315

5. Scalability and Resource Management: Optimizing for deployment on consumer hardware with limited computational resources.

ISO 9001:2008 Certified Journal

Page 581