1,2,3
Phishing Website Detection Using Machine Learning
Megha Agarwal1, Shruti Jani2 , Hansika Koli3, Prof. Deepali Maste4 B.E. Student, Information Technology Engineering, Atharva College of Engineering, Mumbai, India
***
Abstract - Phishing internet sites are one of the internet protections issues that focus on human vulnerabilities rather than software program vulnerabilities. It can be defined because of the process of attracting online users to gain their touchy facts which include usernames and passwords. In this paper, we provide a sensible machine for detecting phishing websites. The gadget acts as a further functionality to an internet browser as an extension that routinely notifies the consumer whilst it detects a phishing internet site. The system is based on a device gaining knowledge of approach, particularly supervised mastering. We have decided on the XGBoost method because of its true overall performance in classification. Our focus is to pursue a better overall performance classifier by analyzing the features of phishing websites and choose the better aggregate of them to train the classifier. As a result, we finished our paper with accuracy of 99% accuracy with 48 features.
Key Words: XGBoost (Extreme Gradient Boosting), Classifier, Features, Phishing, Train, Accuracy.
1. INTRODUCTION
Internetandcloudtechnologyimprovementsinrecentyears havesignificantlyincreasedelectronictrade,orconsumerto-consumer online transactions. The resources of an enterprise are harmed by this growth, which permits unauthorised access to sensitive information about users. Onewell-knownattackthatmanipulatesusersintoaccessing dangerous content and giving up their information is phishing. Most phishing websites use the same website interface and universal resource location (URL) as the legitimatewebsites.
1.1 Purpose
Internet consumers lose billions of dollars each year as a consequence of website phishing. Phishers prey on people'sonlinesecuritybystealingusernames,passwords, and financial account information. Due to the use of URL obfuscationtoshortentheURL,linkredirections,modifying links to make them appear trustworthy, and a long list of other techniques, detecting phishing websites is difficult. This made it necessary to convert from conventional programming methods to an approach based on machine learning.
When new phishing strategies are launched, phishing detection solutions do suffer from low detection accuracy
and high false alarm rates. Additionally, since registering new domains has gotten simpler, the most popular methodology, the blacklist-based method, is ineffective at responding to phishing assaults that are on the rise. No comprehensive blacklist canguarantee a flawlessly up-todatedatabase.
1.2 Objective
To create a reducing method for identifying dangerousURLsandwarningusers.
TouseMLapproachesinthesuggestedapproachto analysereal-timeURLsandgenerateusefulresults.
ToimplementedtheideaofRNN,awell-knownML technique that can handle enormous volumes of data.
Theuseofmachinelearningiscrucialinpreventing phishing attacks. This study investigates characteristics and techniques for machine learning-baseddetection.
2. LITERATURE REVIEW
MAHAJAN MAYURI VILAS, KAKADE PRACHI GHANSHAMSAWANT, PURVA JAYPRALASH and PAWAR SHILA [1] in their paper “Detection of Phishing Website UsingMachineLearningApproach”,thegoalofthestudyis to carry out ELM employing 30 different primary components that are characterized using ML. To prevent beingdiscovered,mostphishingURLsuseHTTPS.Website phishingcanbeidentifiedinthreedifferentways.Thefirst method evaluates several URL components; the second methodassessesawebsite'sauthority,determinesifithas beenintroducedornot,anddetermineswhoisinchargeof it;thethirdmethodverifiesawebsite'sveracity.
In [2] MALAK ALJABRI and SAMIHA MIRZA proposed a paper“PhishingAttacksDetectionusingMachineLearning and Deep Learning Models” In this study, the highest correlatedfeaturesfromtwodistinctdatasetswerechosen. Thesefeaturescombinedcontent-based,URLanddomainbasedfeatures.Then,acomparisonoftheperformanceofa number of ML models was carried out. The results also sought to pinpoint the top characteristics that aid the algorithminspottingphishingwebsites.TheRandomForest (RF)methodproducedthebestclassificationresultsforboth datasets.
ADARSHMANDADIandSAIKIRANBOPPANAintheirstudy [3],theuser-receivedURLswillbeenteredtothemachine learningmodel,whichwillthenprocesstheinputandreport theresults,indicatingwhethertheURLsarephishingornot. SVM, Neural Networks, Random Forest, Decision Tree, XG boost,andothermachinelearningalgorithmscanallbeused to categorize these URLs. The suggested method uses the Random Forest and Decision Tree classifiers. With an accuracy of 87.0% and 82.4% for Random Forest and decision tree classifiers, respectively, the suggested techniquesuccessfullydistinguishedbetweenPhishingand LegitimateURLs.
In [4] HEMALI SAMPAT, MANISHA SAHARKAR, AJAY PANDEY AND HEZAL LOPES have proposed a system for Detection of Phishing Websites using Machine learning. Their proposed method uses both Classification and Association algorithms to optimise the system, making it faster and more effective than the current approach. The proposed system's inaccuracy rate is reduced by 30% by combiningthesetwoalgorithmswiththeWHOISprotocol, making it an effective technique to identify phishing websites.
SAFAALREFAAI,GHINAÖZDEMIRandAFNANMOHAMED [5]usedMachineLearningisbeingusedtodetectphishing websites.TheyusedKaggledatawith86featuresand11,430 totalURLs,halfofwhicharephishingandhalfofwhichare legitimate.TheytrainedtheirdatausingDecisionTree(DT), Random Forest (RF), XGBoost, Multilayer Perceptrons, KNearest Neighbors, Naive Bayes, AdaBoost, and Gradient Boosting,withXGBoost
In[6],SUNDARAPANDIYANS,PRABHASELVARAJ,VIJAY KUMAR BURUGARI, JULIAN BENADIT P and KANMANI P employed a wide range of techniques, including Decision Tree, Random Forest, Multi-Layer Perceptrons, XG Boost Classifier,SVM,LightBGMClassifier,andCatBoostClassifier. OurteamdiscoveredthatLightGBMhadthebestprecision, withanaverageaccuracyofabout85.5%.OneclassSVM,on theotherhand,hasthelowestprecision,atabout79.6%.
3. PROPOSED SOLUTION
Eachtypeofphishingdiffersslightlyinhowtheprocedureis carriedouttodeceivetheunwarycustomer.Whenahacker sendsapotentialuseranemailwithalinkthattakesthemto phishingwebsites,thisisknownasanemailphishingattack.
We use different machine learning models trained over features like if URL contains @, if it has double slash redirecting,pagerankoftheURL,numberofexternallinks embeddedonthewebpage,etc.Neuralnetworkperceptron on data provided by Machine Learning and were able to achieveabetteraccuracy.Thisapproachcouldgetupto92% truepositiverateand0.4%falsepositiverate.
4. METHODOLOGY
Datacollection,cleaning,andconsolidationintoasinglefile ordatatableareallstepsintheprocessofdatapreparation, which is done largely for analytical purposes as shown in Fig2.1.Thefollowingarethemainactivitiesweutilisefor datapreparation:datareduction,datatransformation,data integration,anddatadiscretization.
Thecruciallibraries,includingXGBoost,Numpy,Matplotlib, Pandas, and Numpy, are loaded first. The dataset from Kaggle is then imported after the libraries have been imported. "Phishing Legitimate full" is the name of the datasetthatwehaveselected.Wedividedthedatasetinto trainingandtestingsetsafterimportingitusingtraintest split from sklearn. 20% of the dataset is used for testing, while80%isusedforthetrainingset.
Wehavesetupa model thatusesfive distinctalgorithms, includingLogisticRegression,KNeighborsClassifier,Random Forest,DecisionTree,andXGBoost,tocomparetheaccuracy of various techniques. We work on model fitting, which makespredictions,toachievethedesiredresult,andthenwe workonmodelevaluation.Forthisevaluation,testdatais utilised. We compare the accuracy of each method using severalalgorithms,suchasconfusionmatrix,toobtainthe bestresult.
5. SCOPE
Internetcustomerslosebillionsofdollarseveryyeardueto websitephishing.Phisherspreyonpeople'sonlinesecurity by stealing their usernames, passwords, and financial accountinformation.TheCOVID-19epidemichasincreased technologyuseacrossallindustries,leadingtoatransition fromofflinetoonlinespacesfortaskslikeschedulingofficial meetings, going to classes, buying, and making payments. Thismeansthatphishers will havemore chancestocarry
outassaultsthatharmthevictim's finances,psychological well-being,andprofessionalprospects.
This process can be made much more difficult by the introduction of browser extensions or sophisticated GUIs thatanalyseURLstodeterminewhethertheyarelegitimate phishingsites.Wearecurrentlygettingclosertolaunching thebrowserextensionforthisproject.caneventestoutthe GUI option. The further characteristics can be updated as soon as possible. We are eager to create a complete programme that, rather than requiring verification, immediatelydisablesthewebsite.
6. RESULT
To acquire useful results, we've compared a number of algorithms.Therearenumerousalgorithmsthatcanbeused to identify phishing websites; however, after reviewing numerousresearcharticles,wesettledonfivealgorithmsto testthemodel.
6.1 Model Comparison
Table -1:
6.2 Model Output
Fig
-5.2.1:Output
XGBoostisadistributedgradientboostinglibrarythatwas created to be incredibly powerful, versatile, and portable. Themachinelearningtechniquesareimplementedusingthe
Gradient Boosting framework. XGBoost (also known as GBDT or GBM), a parallel tree boosting technique, is availabletoswiftlyandaccuratelyaddressanumberofdata scienceproblems.
The term "XGBoost" refers to a proficiency configuration. Generally speaking, it is not feasible to rely only on one machinelearningmodel.Throughoutfiteducation,atactical strategy to handling the prophetic power of integrating differentstudentsisoffered.Asinglemodelthatdisplaysthe total outcome from numerous models is the end result.Wetrained ourdata using Logistic Regression, KNeighborsClassifier, Random Forest, Decision Tree, and XGBoost with X G Boost achieving the highest accuracy of 99.05.
7. CONCLUSIONS
Tothebestofourknowledge,thisstudyisthefirstanalysis toincludethefindingsofallotherstudiesintothedetection ofphishingwebsitesusingmachinelearningtechniques.The suggestedresearchonphishingusesacategoricalparadigm, where phishing websites are thought to automatically classifywebsitesintoagivenrangeofsophisticatedvalues dependingonavarietyoffactorsandthegrandeurvariable.
The website functionality is used by ML-based phishing approaches to collect information that could be used to classify websites for the purpose of identifying phishing sites. Developing focused anti-phishing approaches and methodsaswellasminimizingtheirinconveniencearetwo waystopreventphishing.
We achieved 99.05% detection accuracy using XG Boost algorithmwithlowestfalsepositiverate. Alsoresultshows thatclassifiersgivebetterperformancewhenweusedmore dataastrainingdata.
REFERENCES
[1] M.M.Vilas,K.P.Ghansham,S.P.JaypralashandP. Shila,"DetectionofPhishingWebsiteUsingMachine Learning Approach," 2019 4th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques (ICEECCOT), Mysuru, India, 2019, pp. 384-389, doi: 10.1109/ICEECCOT46775.2019.9114695.
[2] M.AljabriandS.Mirza,"PhishingAttacksDetection using Machine Learning and Deep Learning Models,"20227thInternationalConferenceonData ScienceandMachineLearningApplications(CDMA), Riyadh, Saudi Arabia, 2022, pp. 175-180, doi: 10.1109/CDMA54072.2022.00034.
[3] A.Mandadi,S.Boppana,V.RavellaandR.Kavitha, "Phishing Website Detection Using Machine
Learning,"2022IEEE7thInternationalconference for Convergence in Technology (I2CT), Mumbai, India, 2022, pp. 1-4, doi: 10.1109/I2CT54291.2022.9824801.
[4] HemaliSampat,ManishaSaharkar,AjayPandeyand HezalLopes,“DetectionofPhishingWebsiteUsing Machine Learning,” 2018 International Research Journal of Engineering and Technology (IRJET),2018, e-ISSN: 2395-0056, p-ISSN: 23950072.
[5] S.Alrefaai,G.ÖzdemirandA.Mohamed,"Detecting PhishingWebsitesUsingMachineLearning,"2022 International Congress on Human-Computer Interaction,OptimizationandRoboticApplications (HORA), Ankara, Turkey, 2022, pp. 1-6, doi: 10.1109/HORA55278.2022.9799917.
[6] SundaraPandiyanS,PrabhaSelvaraj,VijayKumar Burugari, Julian Benadit P, Kanmani P, Phishing attack detection using Machine Learning, Measurement:SensorVolume24,2022,100476,ISSN 2665-9174
BIOGRAPHIES
MeghaAgarwal
I.T Engineer (2019-2023) from Atharva College of Engineering, Malad,Mumbai
ShrutiJani
I.T Engineer (2019-2023) from Atharva College of Engineering, Malad,Mumbai
HansikaKoli
I.T Engineer (2019-2023) from Atharva College of Engineering, Malad,Mumbai