Issuu

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 05 | May 2023

p-ISSN: 2395-0072

www.irjet.net

Email Spam Detection Using Machine Learning Prof. Sakshi Shejole, Tamboli Abdul Salam, Manish Kumar Gupta, Krishna Sharma, Safwan Attar ALARD COLLEGE OF ENGINEERING & MANAGEMENT (ALARD Knowledge Park, Survey No. 50, Marunje, Near Rajiv Gandhi IT Park, Hinjewadi, Pune-411057) Approved by AICTE. Recognized by DTE. NAAC Accredited. Affiliated to SPPU (Pune University). ---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Email spam has become a significant challenge

subsets: a training set and a test set. The training set will be used to train the models on labelled email examples, enabling them to learn patterns and features indicative of spam emails. The test set, on the other hand, will be used to assess the models' performance and determine their accuracy in classifying unseen email instances.

in today's digital landscape, leading to productivity losses, privacy breaches, and increased cybersecurity risks. This abstract presents a novel approach to combating email spam using machine learning and the TF-IDF (Term FrequencyInverse Document Frequency) technique from natural language processing (NLP).

2. MACHINE ALGORITHMS

Key Words: (Machine Learning, Naive Bayes, SVM, Decision Tree, Random Forest, KNN, TF-IDF)

Email spam has become a pervasive issue, inundating inboxes with unsolicited and potentially malicious messages. Detecting and filtering out spam emails is crucial to ensure data security, privacy, and productivity. Machine learning, combined with natural language processing (NLP) techniques, provides an effective approach to tackle this problem. This project aims to develop an email spam detection system utilizing the TF-IDF (Term FrequencyInverse Document Frequency) NLP technique to accurately classify emails as spam or non-spam. Why TF-IDF Technique: The TF-IDF technique is a widely adopted NLP method for feature extraction and text representation. TF-IDF calculates a weight for each term in a document based on two factors: term frequency (TF) and inverse document frequency (IDF). TF measures how frequently a term appears within a specific document, while IDF quantifies the rarity of a term across the entire dataset. By combining these factors, TF-IDF captures the importance of terms in distinguishing between spam and legitimate emails. In this project, the TF-IDF technique will be applied to convert email text into numerical feature vectors. DATASET: For this project, a publicly available dataset from Kaggle will be utilized. Kaggle is a popular platform for data science and machine learning competitions, and it provides a diverse range of datasets for various domains. The selected dataset will consist of labelled emails, including both spam and non-spam instances, allowing us to train and evaluate our email spam detection system effectively.

Impact Factor value: 8.226



Support Vector Machines (SVM): SVM is a powerful algorithm that seeks to find an optimal hyperplane to separate spam and non-spam emails based on the TF-IDF features.



Random Forest: Random Forest constructs an ensemble of decision trees to make predictions. It can effectively handle high-dimensional feature spaces and provide accurate email spam classification.



k-Nearest Neighbors (k-NN): k-NN classifies emails based on the similarity of their TF-IDF feature vectors to the vectors of the labeled examples in the training set.



Decision Tree: Decision trees use a hierarchical structure of nodes to make decisions. They can capture important features for email spam classification based on the TF-IDF values.



Multinomial Naive Bayes (MultinomialNB): MultinomialNB is a probabilistic algorithm that models the conditional probability distribution of the TF-IDF features given the class labels. It can handle text-based data efficiently.

By applying these machine learning classification algorithms to the TF-IDF features extracted from the email dataset, we aim to build a robust and accurate email spam detection system capable of differentiating between spam and legitimate emails, thereby improving email security and user experience.

Train and Test datasets: To train and evaluate the machine learning models, the Kaggle dataset will be divided into two

CLASSIFICATION

Several machine learning classification algorithms will be explored for email spam detection using the TF-IDF features. The algorithms to be considered include:

1. INTRODUCTION

LEARNING

ISO 9001:2008 Certified Journal

Page 1474