Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 13 Issue: 02 | Feb 2026

p-ISSN: 2395-0072

www.irjet.net

A Hybrid Ensemble Framework with Cross-Validation and Hyperparameter Optimization for Robust Spam Email Classification Hemangi H Joshi1, Nehal N Kalani2 1Assistant Professor, Information Technology Department, VVP Engineering College, Gujarat, India 2Assistant Professor, Information Technology Department, VVP Engineering College, Gujarat, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Spam messages continue to threaten digital

and Decision Trees have been widely applied in spam classification tasks. Despite their effectiveness, individual classifiers often face limitations in balancing precision and recall. Ensemble learning techniques address this limitation by combining multiple classifiers to improve predictive performance and robustness [4].

communication platforms by affecting user productivity and system security. Traditional single-classifier approaches often struggle to maintain high generalization performance across diverse spam patterns. This study proposes a hybrid ensemble framework for robust spam email classification by integrating TF-IDF feature extraction with hyperparameteroptimized machine learning models [1]. The SMS Spam Collection dataset containing 5,572 labelled messages was used for experimental evaluation. Logistic Regression, Support Vector Machine (SVM), and an optimized Random Forest classifier were combined using a soft voting ensemble strategy [2][3]. Hyperparameter tuning was performed using GridSearchCV with 5-fold cross-validation to enhance model reliability. Performance evaluation was conducted using Accuracy, Precision, Recall, F1-score, ROC-AUC, and cross-validation metrics. Experimental results show that the proposed Hybrid Voting Ensemble achieved the highest accuracy of 98.57% and a cross-validation score of 98.13%, outperforming individual models. The findings confirm that ensemble learning improves predictive consistency and classification robustness for spam detection tasks while maintaining computational efficiency suitable for real-time deployment.

This research proposes a hybrid ensemble framework integrating Logistic Regression, Support Vector Machine, and an optimized Random Forest classifier using a soft voting mechanism [8]. The study incorporates TF-IDF feature extraction, hyperparameter optimization using GridSearchCV, and 5-fold cross-validation to ensure model reliability and generalization capability. The objectives of this research are:  To evaluate the performance of individual classifiers for spam detection.  To optimize Random Forest parameters using hyperparameter tuning.  To design a hybrid ensemble framework for improved classification performance.  To analyze model robustness using crossvalidation and ROC-AUC metrics.

Key Words: Spam Detection, Ensemble Learning, TFIDF, Random Forest, SVM, Cross-Validation, Machine Learning.

2. LITERATURE REVIEW Spam detection has been extensively studied in machine learning and text classification research. Early studies focused on Naïve Bayes classifiers due to their simplicity and efficiency in probabilistic text modelling.

1.INTRODUCTION Electronic communication platforms have become integral to modern society, enabling instant information exchange across the globe. However, the rapid growth of digital communication has led to a significant increase in unsolicited and malicious messages commonly referred to as spam. Spam messages reduce user productivity and pose security risks through phishing attacks, malware distribution, and fraudulent schemes.

Support Vector Machines demonstrated strong performance in high-dimensional text classification tasks due to their margin maximization capability [2]. Random Forest classifiers improved robustness by aggregating multiple decision trees, thereby reducing overfitting and enhancing predictive stability [3]. Recent research emphasizes ensemble learning techniques such as boosting and voting mechanisms to improve generalization performance [4]. Boosting algorithms iteratively improve weak learners, while votingbased ensembles combine predictions from heterogeneous classifiers.

Traditional spam detection systems relied on rule-based filtering mechanisms. Although effective in early stages, these approaches lack adaptability to evolving spam patterns. Machine learning-based techniques have gained prominence due to their ability to automatically learn discriminative features from textual data. Algorithms such as Logistic Regression, Support Vector Machines (SVM),

Impact Factor value: 8.315

ISO 9001:2008 Certified Journal

Page 835