International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072
![]()
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072
Adwait Changan1 , Vaibhav Mahalle2 , Prafull Patil3, Prabuddha Salve4
Department of Information Technology, Sinhgad College of Engineering, Pune, India. 1, 2, 3, 4 ***
Abstract – Today'sgenerationisincreasinglyrelianton web technology for a variety of tasks such as banking, communication, and so on. As a result, users can encountermultiplesecuritythreats,withphishingbeing oneofthemostseriousandprominentattacks.Phishing attacksattempttostealsensitiveinformationfromusers by impersonating a legitimate entity. The attacker employs a phishing attack to obtain the victims' credentials, such as their bank account number, passwords, or other sensitive information by impersonating a genuine website, and the victim is unaware of the phishing website. So, in this paper, a systemisproposedtodetectphishingsitesusingmachine learning in real-time by utilizing a classifier that is trainedonanexhaustivedatasetwithenrichedfeatures.
Key Words: Phishing, Cyber Security, Random Forest, MaliciousURLdetection,MachinelearninginCybersecurity
Today, the majority of individual and organizational communicationandinteractiontakesplaceviatheinternet, andthistrendisexpectedtocontinueandgrow.Peopleare currentlyheavilyrelianton webtechnology,andmostare unaware of the cyber threat due to a lack of technical knowledge. Numerous organizations and businesses have already been confronted with the threat and problem of cyber-attacks. Among different types of cyber-attacks present, Phishing is a significant cyber-attack that canthreatens online users' identities. Phishing attacks typically involve an attacker who will act asa credible resource in order to steal sensitive datafrom victims. Victimsofsuccessfulattacksvisitphishingwebsiteswithout recognising it. Once on the website, users can provide privateinformationsuchaspasswordsorbanking-related sensitive information, and they are also at risk of downloading malware that the attacker has placed on the site.
This paper's main objective is to present atechniquefor identifyingphishingwebsites.Thereareapproximatelyfive waystothisproblem,includingtheBlacklistingstrategy,the Rule-basedorHeuristics-basedapproach,theContent-based approach,andtheMachineLearningapproach,whichisthen improvedbytheHybridapproach[9].Thesuggestedmethod usesamodelthatistrainedonthewebsite'scontentsand URL-based features to determine if the site is a legitimate websiteoraphishingwebsite.
1) Mehek Thakera, Mihir Parikhb, Preetika Shettyc Vinit Neogid,Shree
Thispaperproposesasystemthatwilldetectoldandnewly generatedphishingURLsusingDataMining.Acloud-based classifier is developed which takes features of URL as an input. The model will be deployed using the chrome extension. The model will be trained with an exhaustive datasetandusesURL-basedandDomain-basedFeaturesto ensuremaximumaccuracy[1].
2)
A comparative study of the important anti-phishing tools wascompletedandtheirlimitationswerepointedout.This paperanalyzedtheURL-basedfeaturesusedinthepastand improvedtheirdefinitionsasperthecurrentscenario[9]. There is a full implementation of the anti-phishing tool showninthepaper.Also,theaccuracyandobservation of thedevelopedtoolaregiven[9].
3)
Koray
Ebubekir Buber, Onder Demir, BanuDiri
Inthispaper,areal-timeanti-phishingsystem,usingseven different classification algorithms and natural language processing(NLP)basedfeatures,isproposed[4].Thesystem has the following distinguishing properties: language independence,useofahugesizeofphishingandlegitimate data, real-time execution, detection of new websites, independencefromthird-partyservicesanduseoffeaturerichclassifiers[4].Newdatasetisconstructedformeasuring theperformanceofthesystemandtheexperimentalresults aretestedonit.Accordingtotheexperimentalresultsfrom theimplementedclassificationalgorithms,RandomForest algorithm with only NLP based features gives the best performancewiththe97.98%accuracyratefordetectionof phishingURLs[4].
4)DharmarajR.Patil,andJayantraoB.Patil(2018)
This paper gives a methodology to detect malicious URLS andthetypeofattacksbasedonmulti-classclassification.In thiswork,theyproposed42newfeaturesofspam,phishing andmalwareURLS.Thesefeatureswerenotconsideredin the earlier studies for phishing URLs detection and attack typesidentification[3].Thetrainingdataforthedeveloped tool was created with help of 26041 benign and 23894
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072
maliciousURLscontaining11297malwares,8976phishing and 3621 spam URLS. Experiments are performed on the createddatasetusingmachinelearningclassifiers[3].
Thepurposeofthisstudyistoprovideathoroughoverview and a structural knowledge of machine learning-based maliciousURLdetectionmethods.Theyoutlinetheformal definitionofmaliciousURLdetectionasamachinelearning challenge, classify, and evaluate the contributions of literatureworksthataddressvariousaspectsofthisissue. PaperoffersnumerousURL-andcontent-basedcapabilities thatcanbeutilizedtoimprovemodeltraining[2].
Thesuggestedsystemwillhaveaclient-serverdesign.Onthe client side, a chrome extension will be used to send the Uniform Resource Locator (URL) and Web page source attributetotheserverthattheuserispresentlyvisiting.A cloud-based model for phishing site detection will be constructed on the server side which is trained using randomforestalgorithm.[1]
betterrankingasthesepageswillhaveaverylowpossibility tobephishingwebpages.Itisbecausemaliciouswebsites will have less traffic and lower ranking on search engines due to their limited life span. As a result, a dataset with 40000URLscanbeformed.
3.2
Featureselectionisanimportantprocessbecauseithasa large impact on model accuracy in the real world. The process of selecting the best set of features for model trainingisknownasfeatureselection.Inproposedsystem featuresusedareURLbasedandContentbasedfeatures.
3.2.1 : URL based features:
These are features that are obtained from the URL that whichuseriscurrentlyvisiting.
1)Protocolcheck:Tocheckifprotocalusedis"https".
2)Wordcount:AfterparsingURLthroughspecialcharacters wordsarecounted
3)Averagewordlength:Averagelengthofwordsobtained afterparsing
4)Character count: Total number of characters present in URL.
5)Digitcount:TotalnumberofdigitspresentinURL
6)Special characters count: Total number of special characterspresentinURL
7)Keyword count: Keywords like login, gift, secure, etc. count
Thedatasetneededtotrainthemodelshouldbelarge,with twodistinctclasses:legitimateandphishing.Furthermore, thedatasetshouldincludeabalancedmixoflegitimateand phishingsites.PhishTankwillmostlybethesourceofthe phishingURLs.[5]Forlegitimatepages,Alexa,Statista,and Similarweb can be used to get pages with high traffic and
8)Brand name count: Keywords like facebook, gmail, etc. count.
9)Lookalikekeywordscount:Keywordslikelogiin,seccure, etc.
10)Lookalikebrandnamecount:Keywordslikefaceb00k, instagrom.
11)Random words : Words which are not keywords and brandname
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072
12)Lengthoffilepath:Thelengthcalculatedforpath.
13)Top level domain check: Verify if Top level domain is mostwidelyuseddomianlike:com,edu,org,etc.
14)OccuranceofSubdomian:OccuranceofSubdomianare usuallymoreinmalacioussite.
3.2.2 : Content based features:
Thesefeaturescanbederivedfromthesourcecodeofthe pagewhichuserwantstoaccess.Featureare:
1)Wordcount:Totalnumberoftextwordspresentonweb page.
2) Average word length: Average length of text words presentonwebpage.
3)LinksCount:Totalnumberoflinkspresentinwebpage.
4)Iframetagcount:TotalnumberofIframetagspresentin webpage.
5)Embedtagcount:Totalnumberofembedtagspresentin webpage.
6) Common Phishing word count: Words like pay, bonus, free,access,log,etc.count.
Raw data is transformed into usable formats during data preprocessing. Decomposers can be used on URLs to split itand extract the necessary parts in order to obtain attributessuchasbrandname,portocol,etc.Themostwellknownandfrequentlyusedbrandnamesandkeywordsare gatheredandcheckedfortheirpresenceinURLs.Toextract URL-basedfeatures,theURLvisitedbytheuserissplitinto wordsusingspecialcharacters.Afterthat,brandnameand keywordchecksareperformedontheobtainedwords.Ifa splitedwordisnotfoundinbothdictionaries,itissenttoa worddecomposer,whichcansplittwoadjacentwordsina string into two separate words. Word decomposer firstly creates substrings of the input string passed. Then a dictonarycheckismadeontheobtainedsubstringstoknow thewordspresent.Ifitisunabletoseparate,thentheword’s similarity to available brand names and keywords is examined.Still,ifthereisnosimilarity,thenitistreatedasa random word. Depending on the status of a word under reviewappriopriatefeaturesareincremented.Forcontent basedfeaturewebcrawlingcanbedonetogetthevaluefor thefeatures.
Withthedatasetcreated,multipleclassifiersweretrained. The Random Forest algorithm was the most accurate. Random forests are an ensemble learning method for
Classification, Regression, and other tasks that operate by constructing a multitude of decision trees at training time.The trained model will be deployed using cloud servicesintheproposedsystem.
Table -1: AccuracyofClassifierstrained Classifier Accuracy
NaiveBayes 91.9832
SupportVectorMachine 945901 NeuralNetwork 96.3394 Random Forest 97.3659 K-NearestNeighbor 97.1384
Phishingwebsiteisoneofthechallengingsecurityproblems facedrecentlyduetotheriseofwebpagesworldwideand thedetectionofthesewebsitesaslegitimateandphishingis oneofthechallengingaspects.TheDetectionandPrevention ofPhishingWebsitessystemofferssecurityfortheuserwho can easily fall into a trap due to a lack of awareness or technical knowledge. So, a system is developed with enrichedFeaturesandaRandomForestclassifierisusedto achieve better accuracy. Trained model will be deployed usingcloudservices.Onclientside,browserextensionwhich isasmallsoftwaremoduleforcustomizingawebbrowser willbeused[11].TheextensionwillsendtheURLthatthe user is attempting to access to a cloud-based feature extractor,whichwillthensupplytheextractedfeaturetothe model for detection. Furthermore, the classificationresult willbeupdatedtoanextensiontorestrictuseraccesstothe pageifitisaphishingsite.
Theproposedsystemwillhaveadvantages:
1.Real-timeExecution. 2.HugeSizeofPhishingandLegitimateData. 3.DetectionofnewWebsites 4.IndependencefromThird-PartyServices
5.UseofEnrichedFeatures.
[1] Mehek Thakera, Mihir Parikhb, Preetika Shettyc, Vinit Neogid,ShreeJaswale."Detectingphishingwebsitesusing DataMining"2018.
[2] Doyen Sahoo, Chenghao Liu, and Steven C.H. Hoi. "MaliciousURLdetectionusingMachineLeaming:ASurvey" 2019.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 09 Issue: 12 | Dec 2022 www.irjet.net p-ISSN: 2395-0072
[3]DharmarajR.Patil,andJayantraoB.Patil."Feature-based MaliciousURLandAttackTypeDetectionUsingMulti-class Classification"July2018,Volume10,Number2.
[4] Ozgur Koray Sahingoz, Ebubekir Buber, Onder Demir, BanuDiri"Machinelearningbasedphishingdetectionfrom URLs"25July2018.
[5]EbubekirBuber,BanuDiri,andOzgurKoraySahingoz. "NLPbasedphishingattackdetectionfromURLs."November 2007.[6]
[6]VarsharaniHawanna,V.Y.Kulkarni,R.A.Rane."Anovel algorithmtodetectphishingURLs"2016.
[7] Yu Zhou, Yongzheng Zhang, Jun Xiaon,Yipeng Wang, WeiyaoLin"Visualsimilaritybasedanti-phishingwiththe combinationoflocalandglobalfeatures"2014.
[8] M. Amaad Ul Haq Tahir, Sohail Asghar, Ayesha Zafar, Saira Gillani. "Hybrid model to detect phishing sites using Supervised Learning Algorithms" 2016. [9] Srushti Patil, Sudhir Dhage. "A Methodical Overview on Phishing DetectionalongwithanOrganizedWaytoConstructanAntiPhishingFramework"2019.
[10]MicrosoftContributors.Phishing[online]Available: https://docs.microsoft.com/enus/windows/security/threatprotection/intelligence/phishing
[11] Wikipedia Contributors. Browser Extension[online] Available: https://en.wikipedia.org/wiki/Browser_extension
2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal