A Survey on Spam Filtering Methods and Mapreduce with SVM

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395 -0056

Volume: 04 Issue: 03 | Mar -2017

p-ISSN: 2395-0072

www.irjet.net

A Survey on Spam Filtering Methods and Mapreduce with SVM Dipika Somvanshi1, Prof. Kanchan Doke2 1 ME(Comp.)

Student of Bharati Vidyapeeth College of Engineering, Kharghar, Navi Mumbai ,India Computer Dept., Bharati Vidyapeeth College of Engineering, Kharghar, India ---------------------------------------------------------------------***--------------------------------------------------------------------2Proffesor,

Abstract - Spam is any unwanted and harmful mail

send to massive recipients in bulk quantity. Spam can be harmful as it may contain malware & links to phishing websites. So Separation of spam from normal mails in separate folder is essential. Techniques to separate spam mails are word based, content based, machine learning based and hybrid. Machine learning techniques are most popular because of high accuracy and mathematical support. This paper surveys different spam filtering techniques. SVM is the popular machine learning techniques in spam filtering because it can handle data with large number of attributes. SVM requires more time to train the data and for training it can’t work with large a dataset, these drawbacks can be minimized by introducing MapReduce framework for SVM. MapReduce framework can work in parallel with input dataset file chunks to train SVM for time reduction. This paper aims at surveying of few such spam filtering techniques and scope to introduce MapReduce with SVM. Key Words: Spam Filtering; Machine Learning Techniques; Naïve Bays; KNN; Decision Trees; SVM; Mapreduce

1.

INTRODUCTION

The email system is one of the most used communication tools. Email is a quick means of communication because one has not to wait for the response and it is straightforward way to stay in touch with the all. One major threat to an email system is spam e-mail. The spam e-mail is nothing but the unwanted mail send in bulk quantity by spammers group for their advantage. Spammers are group of people intended to spread malicious content, advertise content, Links to phishing websites through email. Spam Emails causes overloading of server bandwidth, storage, cost, time for separation of spam emails from ham E-mails. According to the SMX email security provider, the live spam percentage is about 79.5%. The average size of spam is 16 KB[1]. So classification of Emails in spam & ham is most important issue.

© 2017, IRJET

|

Impact Factor value: 5.181

|

For the separation of such spams from important mails, spam filtering is important. Various spam filtering techniques exist in literature survey. Spam filtering techniques are classified as Machine Learning based, Content based, List based, Hybrid. Amongst them Machine learning techniques give more accurate results due to their mathematical background. Machine learning techniques works with data mining algorithms and gives more satisfied results. For spam filtering filters are trained with algorithms for sample data set of emails & then tested for new sample of emails. Machine learning based spam classification algorithms are SVM, Naïve Bays, KNN, Decision Trees, etc. Amongst these, Naïve Bayesian classification and Support Vector Machine are most used and appreciated by researchers. Also, number of freeware and paid tools are available for spam filtering, they also makes use of these techniques. Support Vector Machines (SVM) can be applied efficiently in spam filtering. SVM works with Kernel function and gives most satisfied results in spam filtering. SVM works best with small set of data input. But it’s performance degrades with increase in size of dataset. It requires large time to train filter. So this issue needs to be addressed. MapReduce can be effectively used for training of large dataset input. It can process large data within less time. So in this proposed system we have used MapReduce with SVM to classify large set of emails into spam and ham. MapReduce with SVM gives more speedup than traditional SVM algorithm.

2. LITERATURE REVIEW OF MACHINE LEARNING TECHNIQUES A. Clustering:

Clustering is used for separation of objects into relative classes called clusters. It classifies object or observations in such a way that objects in a group are more similarto each other than tothose in other group.[2] K- Nearest Neighbor: It is one of the simplest machine learning algorithms. KNN works with ‘characteristics vector’. The characteristic vectors are measure of similarities among all messages. In this algorithm new incoming email is classified on the basis ISO 9001:2008 Certified Journal

|

Page 490


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
A Survey on Spam Filtering Methods and Mapreduce with SVM by IRJET Journal - Issuu