Detecting Phishing Websites Using Machine Learning

Page 1

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

Detecting Phishing Websites Using Machine Learning

Adwait Changan1 , Vaibhav Mahalle2 , Prafull Patil3, Prabuddha Salve4

Department of Information Technology, Sinhgad College of Engineering, Pune, India. 1, 2, 3, 4 ***

Abstract – Today'sgenerationisincreasinglyrelianton web technology for a variety of tasks such as banking, communication, and so on. As a result, users can encountermultiplesecuritythreats,withphishingbeing oneofthemostseriousandprominentattacks.Phishing attacksattempttostealsensitiveinformationfromusers by impersonating a legitimate entity. The attacker employs a phishing attack to obtain the victims' credentials, such as their bank account number, passwords, or other sensitive information by impersonating a genuine website, and the victim is unaware of the phishing website. So, in this paper, a systemisproposedtodetectphishingsitesusingmachine learning in real-time by utilizing a classifier that is trainedonanexhaustivedatasetwithenrichedfeatures.

Key Words: Phishing, Cyber Security, Random Forest, MaliciousURLdetection,MachinelearninginCybersecurity

1. INTRODUCTION

Today, the majority of individual and organizational communicationandinteractiontakesplaceviatheinternet, andthistrendisexpectedtocontinueandgrow.Peopleare currentlyheavilyrelianton webtechnology,andmostare unaware of the cyber threat due to a lack of technical knowledge. Numerous organizations and businesses have already been confronted with the threat and problem of cyber-attacks. Among different types of cyber-attacks present, Phishing is a significant cyber-attack that canthreatens online users' identities. Phishing attacks typically involve an attacker who will act asa credible resource in order to steal sensitive datafrom victims. Victimsofsuccessfulattacksvisitphishingwebsiteswithout recognising it. Once on the website, users can provide privateinformationsuchaspasswordsorbanking-related sensitive information, and they are also at risk of downloading malware that the attacker has placed on the site.

This paper's main objective is to present atechniquefor identifyingphishingwebsites.Thereareapproximatelyfive waystothisproblem,includingtheBlacklistingstrategy,the Rule-basedorHeuristics-basedapproach,theContent-based approach,andtheMachineLearningapproach,whichisthen improvedbytheHybridapproach[9].Thesuggestedmethod usesamodelthatistrainedonthewebsite'scontentsand URL-based features to determine if the site is a legitimate websiteoraphishingwebsite.

2. LITERATURE SURVEY

1) Mehek Thakera, Mihir Parikhb, Preetika Shettyc Vinit Neogid,Shree

Jaswale(2018)

Thispaperproposesasystemthatwilldetectoldandnewly generatedphishingURLsusingDataMining.Acloud-based classifier is developed which takes features of URL as an input. The model will be deployed using the chrome extension. The model will be trained with an exhaustive datasetandusesURL-basedandDomain-basedFeaturesto ensuremaximumaccuracy[1].

2)

SrushtiPatil,SudhirDhage(2019)

A comparative study of the important anti-phishing tools wascompletedandtheirlimitationswerepointedout.This paperanalyzedtheURL-basedfeaturesusedinthepastand improvedtheirdefinitionsasperthecurrentscenario[9]. There is a full implementation of the anti-phishing tool showninthepaper.Also,theaccuracyandobservation of thedevelopedtoolaregiven[9].

3)

Ozgur

Koray

Sahingoz,

Ebubekir Buber, Onder Demir, BanuDiri

Inthispaper,areal-timeanti-phishingsystem,usingseven different classification algorithms and natural language processing(NLP)basedfeatures,isproposed[4].Thesystem has the following distinguishing properties: language independence,useofahugesizeofphishingandlegitimate data, real-time execution, detection of new websites, independencefromthird-partyservicesanduseoffeaturerichclassifiers[4].Newdatasetisconstructedformeasuring theperformanceofthesystemandtheexperimentalresults aretestedonit.Accordingtotheexperimentalresultsfrom theimplementedclassificationalgorithms,RandomForest algorithm with only NLP based features gives the best performancewiththe97.98%accuracyratefordetectionof phishingURLs[4].

4)DharmarajR.Patil,andJayantraoB.Patil(2018)

This paper gives a methodology to detect malicious URLS andthetypeofattacksbasedonmulti-classclassification.In thiswork,theyproposed42newfeaturesofspam,phishing andmalwareURLS.Thesefeatureswerenotconsideredin the earlier studies for phishing URLs detection and attack typesidentification[3].Thetrainingdataforthedeveloped tool was created with help of 26041 benign and 23894

©
2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page991

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

maliciousURLscontaining11297malwares,8976phishing and 3621 spam URLS. Experiments are performed on the createddatasetusingmachinelearningclassifiers[3].

5)

(2019)

Thepurposeofthisstudyistoprovideathoroughoverview and a structural knowledge of machine learning-based maliciousURLdetectionmethods.Theyoutlinetheformal definitionofmaliciousURLdetectionasamachinelearning challenge, classify, and evaluate the contributions of literatureworksthataddressvariousaspectsofthisissue. PaperoffersnumerousURL-andcontent-basedcapabilities thatcanbeutilizedtoimprovemodeltraining[2].

3. PROPOSED SYSTEM

Thesuggestedsystemwillhaveaclient-serverdesign.Onthe client side, a chrome extension will be used to send the Uniform Resource Locator (URL) and Web page source attributetotheserverthattheuserispresentlyvisiting.A cloud-based model for phishing site detection will be constructed on the server side which is trained using randomforestalgorithm.[1]

betterrankingasthesepageswillhaveaverylowpossibility tobephishingwebpages.Itisbecausemaliciouswebsites will have less traffic and lower ranking on search engines due to their limited life span. As a result, a dataset with 40000URLscanbeformed.

3.2

Feature Selection:

Featureselectionisanimportantprocessbecauseithasa large impact on model accuracy in the real world. The process of selecting the best set of features for model trainingisknownasfeatureselection.Inproposedsystem featuresusedareURLbasedandContentbasedfeatures.

3.2.1 : URL based features:

These are features that are obtained from the URL that whichuseriscurrentlyvisiting.

Fig -2:URLComponents

1)Protocolcheck:Tocheckifprotocalusedis"https".

2)Wordcount:AfterparsingURLthroughspecialcharacters wordsarecounted

3)Averagewordlength:Averagelengthofwordsobtained afterparsing

4)Character count: Total number of characters present in URL.

5)Digitcount:TotalnumberofdigitspresentinURL

6)Special characters count: Total number of special characterspresentinURL

7)Keyword count: Keywords like login, gift, secure, etc. count

Fig -1:ProposedSystemArchitecture

3.1 Training Dataset:

Thedatasetneededtotrainthemodelshouldbelarge,with twodistinctclasses:legitimateandphishing.Furthermore, thedatasetshouldincludeabalancedmixoflegitimateand phishingsites.PhishTankwillmostlybethesourceofthe phishingURLs.[5]Forlegitimatepages,Alexa,Statista,and Similarweb can be used to get pages with high traffic and

8)Brand name count: Keywords like facebook, gmail, etc. count.

9)Lookalikekeywordscount:Keywordslikelogiin,seccure, etc.

10)Lookalikebrandnamecount:Keywordslikefaceb00k, instagrom.

11)Random words : Words which are not keywords and brandname

©
Page992
2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal |
DoyenSahoo,ChenghaoLiu,StevenC.H.Hoi

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

12)Lengthoffilepath:Thelengthcalculatedforpath.

13)Top level domain check: Verify if Top level domain is mostwidelyuseddomianlike:com,edu,org,etc.

14)OccuranceofSubdomian:OccuranceofSubdomianare usuallymoreinmalacioussite.

3.2.2 : Content based features:

Thesefeaturescanbederivedfromthesourcecodeofthe pagewhichuserwantstoaccess.Featureare:

1)Wordcount:Totalnumberoftextwordspresentonweb page.

2) Average word length: Average length of text words presentonwebpage.

3)LinksCount:Totalnumberoflinkspresentinwebpage.

4)Iframetagcount:TotalnumberofIframetagspresentin webpage.

5)Embedtagcount:Totalnumberofembedtagspresentin webpage.

6) Common Phishing word count: Words like pay, bonus, free,access,log,etc.count.

3.3 Data Pre-processing:

Raw data is transformed into usable formats during data preprocessing. Decomposers can be used on URLs to split itand extract the necessary parts in order to obtain attributessuchasbrandname,portocol,etc.Themostwellknownandfrequentlyusedbrandnamesandkeywordsare gatheredandcheckedfortheirpresenceinURLs.Toextract URL-basedfeatures,theURLvisitedbytheuserissplitinto wordsusingspecialcharacters.Afterthat,brandnameand keywordchecksareperformedontheobtainedwords.Ifa splitedwordisnotfoundinbothdictionaries,itissenttoa worddecomposer,whichcansplittwoadjacentwordsina string into two separate words. Word decomposer firstly creates substrings of the input string passed. Then a dictonarycheckismadeontheobtainedsubstringstoknow thewordspresent.Ifitisunabletoseparate,thentheword’s similarity to available brand names and keywords is examined.Still,ifthereisnosimilarity,thenitistreatedasa random word. Depending on the status of a word under reviewappriopriatefeaturesareincremented.Forcontent basedfeaturewebcrawlingcanbedonetogetthevaluefor thefeatures.

3.4 Classifiers:

Withthedatasetcreated,multipleclassifiersweretrained. The Random Forest algorithm was the most accurate. Random forests are an ensemble learning method for

Classification, Regression, and other tasks that operate by constructing a multitude of decision trees at training time.The trained model will be deployed using cloud servicesintheproposedsystem.

Table -1: AccuracyofClassifierstrained Classifier Accuracy

NaiveBayes 91.9832

SupportVectorMachine 945901 NeuralNetwork 96.3394 Random Forest 97.3659 K-NearestNeighbor 97.1384

4. CONCLUSIONS

Phishingwebsiteisoneofthechallengingsecurityproblems facedrecentlyduetotheriseofwebpagesworldwideand thedetectionofthesewebsitesaslegitimateandphishingis oneofthechallengingaspects.TheDetectionandPrevention ofPhishingWebsitessystemofferssecurityfortheuserwho can easily fall into a trap due to a lack of awareness or technical knowledge. So, a system is developed with enrichedFeaturesandaRandomForestclassifierisusedto achieve better accuracy. Trained model will be deployed usingcloudservices.Onclientside,browserextensionwhich isasmallsoftwaremoduleforcustomizingawebbrowser willbeused[11].TheextensionwillsendtheURLthatthe user is attempting to access to a cloud-based feature extractor,whichwillthensupplytheextractedfeaturetothe model for detection. Furthermore, the classificationresult willbeupdatedtoanextensiontorestrictuseraccesstothe pageifitisaphishingsite.

Theproposedsystemwillhaveadvantages:

1.Real-timeExecution. 2.HugeSizeofPhishingandLegitimateData. 3.DetectionofnewWebsites 4.IndependencefromThird-PartyServices

5.UseofEnrichedFeatures.

REFERENCES

[1] Mehek Thakera, Mihir Parikhb, Preetika Shettyc, Vinit Neogid,ShreeJaswale."Detectingphishingwebsitesusing DataMining"2018.

[2] Doyen Sahoo, Chenghao Liu, and Steven C.H. Hoi. "MaliciousURLdetectionusingMachineLeaming:ASurvey" 2019.

©
| Page993
2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072

[3]DharmarajR.Patil,andJayantraoB.Patil."Feature-based MaliciousURLandAttackTypeDetectionUsingMulti-class Classification"July2018,Volume10,Number2.

[4] Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, BanuDiri"Machinelearningbasedphishingdetectionfrom URLs"25July2018.

[5]EbubekirBuber,BanuDiri,andOzgurKoraySahingoz. "NLPbasedphishingattackdetectionfromURLs."November 2007.[6]

[6]VarsharaniHawanna,V.Y.Kulkarni,R.A.Rane."Anovel algorithmtodetectphishingURLs"2016.

[7] Yu Zhou, Yongzheng Zhang, Jun Xiaon,Yipeng Wang, WeiyaoLin"Visualsimilaritybasedanti-phishingwiththe combinationoflocalandglobalfeatures"2014.

[8] M. Amaad Ul Haq Tahir, Sohail Asghar, Ayesha Zafar, Saira Gillani. "Hybrid model to detect phishing sites using Supervised Learning Algorithms" 2016. [9] Srushti Patil, Sudhir Dhage. "A Methodical Overview on Phishing DetectionalongwithanOrganizedWaytoConstructanAntiPhishingFramework"2019.

[10]MicrosoftContributors.Phishing[online]Available: https://docs.microsoft.com/enus/windows/security/threatprotection/intelligence/phishing

[11] Wikipedia Contributors. Browser Extension[online] Available: https://en.wikipedia.org/wiki/Browser_extension

2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal

Page994
©
|

Turn static files into dynamic content formats.

Create a flipbook