DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING

Page 1

DETECTION OF PHISHING WEBSITES USING MACHINE LEARNING

1Assistant Professor, Department of CSE, Ballari Institute of Technology & Management, Ballari

2,3,4,5Final Year Students, Department of CSE, Ballari Institute of Technology & Management, Ballari ***

Abstract - Phishing is a technique widely employed to deceive unsuspecting people into exposing their personal information by means of fake websites.PhishingwebsiteURLs are made with the intention of collecting user data, such as usernames, passwords, and details of online financial activities. Phishers use websites that are semantically and visually identical to those authentic websites. By using antiphishing technologies to recognize phishing, we may inhibit the rapid evolution of phishing strategies caused by the rapid advancement of technology. To prevent phishing efforts, machine learning is used as a powerful tool. The four techniques used in this paperareAdaBoostClassifier,XGBoost Classifier, Random Forest Classifier, Gradient Boosting Classifier, and Support Vector Machine (SVM).

Key Words: AdaBoost Classifier, XGBoost Classifier, Random Forest Classifier, Gradient Boosting Classifier and Support Vector Machine (SVM).

1. INTRODUCTION

Phishing is the risky illegal behavior in online world. Phishing efforts have considerably increased over many years as many people utilize the online services given by governmental and private institutions. When con artists discoveredaprofitablebusinessmodel,theydidso.Phishers usearangeofmethodstotargettheunwary,includingvoice overIP(VOIP),messaging,spooflinks,andfakewebsites.It's simple to create a fraud website that is similar to real website,butit'snot.Eventheinformationonthesewebsites would be identical to that on the genuine versions. The targetofthesewebsitesistogatheruserinformation,such asaccountnumbers,logincredentials,debitandcreditcard passwords, etc. Attackers also pose as high-level security measures and ask users to respond to security questions. Thosewhorespondtothoseinquiriesaremorelikelytofall victim to phishing scams. Many investigations have been donetostopphishingassaultsbyvariousgroupsthroughout theworld.Byidentifyingthewebsitesandeducatingpeople to recognize phishing websites, phishing assaults can be stopped.Machinelearningtechniquesarethebestwaysto spotphishingwebsites.

Oneofthekeytechniquesassistingartificial intelligenceis machinelearning(AI).Itisfoundedonalgorithmsdesigned to comprehend and recognize patterns from massive amounts of data to build a system that can forecast anomalousbehavior andoccurrences.Itchangesovertime asitpicksupontypicalbehavioraltendencies.

2. LITERATURE REVIEW

Athoughtfulpieceofwritingknownasaliteraturereview communicates theinformationthatis currentlyavailable, including significant findings and theoretical and methodologicalcommitmentstoaparticularsubject.

M. Somesha et al. investigated the architecture of a systemthatcomprisesoffeaturecollection,featurepicking, and classification procedures. A list of website URLs are used as input to the feature collector, and it pulls the required features from three sources (URL obscuring, anchoring text and other sources based). The obtained features are then fed into the IG attribute positioning algorithm. The proposed model's drawback is that it dependson external services, which means that if these services aren't available, work performance would be limited.Moreover,thesuggestedmodelcouldbeunableto identifyphishingwebsitesthatreplacetextualcontent with embeddedobjects[1].

A very successful phishing website detection model (OFS-NN) based on neural network technology and an appropriate feature selection method was addressed by ErzhouZhuetal.Asigntermedfeaturevalidity value(FVV) has beenconstructedinthissuggestedmodel toevaluate theeffectsofeachofthoseparametersontheidentification ofsuchwebsites.Analgorithmisnowbeingbuilttofindthe bestfeaturesonthephishingattacksbasedonthisrecently created sign. The issue with the neural network greatly lessened by the selectedstrategy. The issue of the neural network'sover-fittingwillbegreatlyreducedbythechosen algorithm.Theneuralnetworkistrainedusingtheseideal attributes to create an ideal classifier that can identify phishing URLs. Nevertheless, the problem is thattheOFS must continually gather additional features due to the expanding number of features that are vulnerable to phishingattempts[2].

DerekDoranandMahdiehZabihimayvantalkedabout theFuzzyRoughSet(FRS)theory,whichwasdevelopedinto atoolthatselectsthebestfeaturesfromasmallnumberof standardizeddatasets.Afterwards,afewclassifiersreceive these features in order to detect phishing. A dataset of 14,500websitemodelsisusedtotrainthemodelsinorder toexaminethefeatureidentificationforFRSindevelopinga commondetectionofphishing.Nevertheless,disadvantage isthatthemethod'suniquepropertiesarenotstated[3].

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page436
Vedavyas 4 , H Niveditha5

Eventhoughtheefficiencyofthesystemwillbeheavily dependentonideaofthefeatures,PengYangetal.offereda theorythatdiscussedfeatureengineeringisacrucialpartin discovering answers for detecting fake websites. The constraintisintheamountoftimeittakestogatherthese features, Even the attributes obtained from all of the different aspects are acknowledgeable. The researchers have suggested many aspects of fraud detection feature perspective that focuses on a quick detection method by utilizingdeeplearningtoaddressthisflaw(MFPD)[4].

According to T. Nathezhtha et al., a triple-phase identification system dubbed the Web Crawler based Phishing Attack Detector (WC-PAD) has been proposed to accuratelyfindtheinstancesofphishing.Theclassificationof phishing and legitimate websites is done using the web's content, web traffic, and website URL as input attributes. Nevertheless,thedisadvantageisthatittakestimebecause there are three phases,andeach website must go through them[5].

Tofindwhichwebsiteisrealorafakesite,C.EmilinShyni etal.discussed.Thisisacreativewaytofindthesewebsites byusingtheGoogleAPItointerceptallofthehyperlinkson thecurrentpageandcreatinga parsetreeoutofall ofthe hyperlinksthatwereintercepted.Here,parsingstartsatthe rootnode.Ifachildnodehastheidenticalvalueastheroot node,itusestheDepth-FirstSearchtechniquetofindit.The disadvantage,however,isthatbothfalsepositiveandfalse negativeratesarehigh[7].

3. PROBLEM STATEMENT

Todesignanddevelopasystemthatshouldtakelesstimeto detect phishing websites so that it can accurately and effectivelyclassifywebsitesasrealorfraudulent.

4. EXISTING SYSTEM

Tostopphishingassaults,amethodbasedonwebcrawling wascreated.Itemploysthreephasesfordetection,usingthe input elements of URL, traffic, and online content. The disadvantageisthatittakestimebecauseeachwebsitemust gothroughthreestages.

A method called Fuzzy Rough Set (FRS) was developed to helpuserstochoosethebestattributesfromasmallnumber ofstandardizeddatasets.Nevertheless,thedisadvantageis thatthemethod'suniquepropertiesarenotstated.

5. PROPOSED SYSTEM

Machinelearningisaninnovativeandpopulartechnology that has a wide range of applications in society and can handle massive amounts of data as well as refined and updatedalgorithms.

Thedatasetissubmittedtotheproposedsystem,where attributeslikeIPAddress,TinyURL,URLLength,URLDepth and others are extracted from the dataset. Among the machine learning techniques, the system uses AdaBoost, Random Forest Classifier, XGBoost, Gradient Boosting Classifier, and Support Vector Machine (SVM). Random Forest Classifier is the algorithm that most accurately determines whether a website URL is legitimate or fraudulent.

6. OBJECTIVES

ď‚· To create a model that can identify phishing websitesfromlegitimatewebsites.

ď‚· Tousedifferentmachinelearningmethodstotrain modeltoproduceaccurateandeffectiveresults.

7. METHOLOGY

Fig. 1: SystemArchitecture

I. AdaBoost Classifier Algorithm:

TheAdaBoostclassifier,commonlyreferredtoasadaptive boosting, is a machine learning ensemble technique. Adaptive boosting is the process of redistributing the weightstoeachinstance,givinglargerweightstoinstances

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page437

thatwereincorrectlyidentified.Boostingisusedtoreduce biasandvariationinsupervisedlearning.Itisbasedonthe ideathatlearnersmakeprogressinsequentialmannerand in stages. Except for the first, every learner after that is created from a previous learner. Iterative construction is usedtointegrateanumberofweakclassifierstoproducea strongclassifierwithhighaccuracy.

AdaBoostmustadheretotworequirements:

1. Interactivetrainingoftheclassifierusingavariety ofweighedtrainingsamplesisrecommended.

2. Byreducingtrainingerror,itseekstoofferasuperb fitforthesesamplesineachiteration.

XGBoost is an extended version which was developed expressly to boost speed and performance. One of the important features of XGBoost is its efficient handling of missingvalues,whichenablesittohandlereal-worlddata with missing values without necessitating a lot of preprocessing. Furthermore, XGBoost has built-in parallel processing capabilities that lets to train models on huge datasetsquickly. Ithas abilitytowork withlargedatasets and offercutting-edge performancein numerous machine learningtasksincludingclassificationandregression.

II. XGBoost Classifier Algorithm:

TheacronymXGBooststandsforExtremeGradientBoosting. Itisanelaboratedgradientboostinglibrarydevelopedtobe extremely accurate, versatile. It uses machine learning methods using Gradient Boosting framework. One of the important features of XGBoost is its efficient handling of missingvalues,whichenablesittohandlereal-worlddata with missing values without necessitating a lot of preprocessing. Moreover, it has built-in parallel processing functionality,makingitpossibletotrainmodelsquicklyon largedatasets.Additionally,byallowingforthefine-tuningof multiplemodelparameters,itisveryversatileandfacilitates performanceoptimization.

its ability to work with large datasets and offer cutting-edge performance in numerous machine learning tasks including classification and regression.

TheRandomForestAlgorithm,whichisfarlesssensitiveto trainingdata,consistsofanumberofrandomdecisiontrees. Creatinganewdatasetfromtheoriginaldataisthefirststep inbuildingaRandomForest.Theactofcreatingnewdatais knownasbootstrapping.Afterwards,arandomselectionof featureswillbeutilizedtotraineachtree.Eachdecisiontree isconstructedandthengivenanewdatapoint.Thenextstep istomergeall ofthetrees.Asit's a classificationproblem, decisionofthemajorityisfollowed.Aggregationisthephrase usedtodescribetheprocessofintegratingalldecisiontree outputs. In Random forest, aggregation occurs after bootstrapping.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page438
Fig. 2: AdaBoostClassifierAlgorithm Fig. 3: XGBoostClassiferAlgorithm III. Random Forest Algorithm:

IV.

Gradient Boosting Classifier Algorithm:

Thisgradientboostingalgorithmisthemostusedalgorithm inthemachinelearningfieldbecauseitmakeslessmistakes andiswidelyknown.Itisusedtoaccuratelypredicterrorsin thesystemandalsousedtolessentheerrors.

The base estimator in the gradient boosting process cannot be identified, unlike the AdaBoost algorithm. The GradientBoostalgorithm'sdefaultfinder,DecisionStump,is fixed.Thegradientboostingalgorithm'snestimatorcanbe adjusted,justlikeAdaBoost.Thedefaultvalueofn_estimator for this algorithm, however, is 100 if we do not specify a numberforit.Itwillforecastcategoricalvariablesandalso continuousvariables.

Using Hyperplanes and Support vectors in the SVM algorithm:

1. Hyperplane: It is feasible to divide classes into a varietyoflinesordecision boundariesintomultidimensions,butitisnecessarytofindthedecision boundary which is best for categorizing the data points. This ideal boundary is known as the SVM hyperplane.Giventhatthedataset'sfeaturesdefine thehyperplane'sdimensions,astraightlinewillbe the hyperplane if there are just two features. In addition, the hyperplane will only have two dimensionsiftherearethreefeatures.

2. Supportvectors:Theclosestdatapointsorvectors nearthehyperplaneandthosethathaveanimpact onthe plane'spositionare referred to asSupport vectors.

V. Support Vector Machine (SVM) Algorithm: SupportVectorMachineislikedbyeveryoneanditisused everywheretosolvetheproblems.ThemainpurposeofSVM istocreateaboundarythatdividesthesystemintosubparts whichwillhelpustofindanswersandtheboundaryiscalled asHyperplane.

8. EXPERIMENTAL

RESULTS

InordertoassesswhetheragivenURLisrealorphishing, thesystemisgivenadatasetofURLattributesincludingIP Address, URL Length, URL Depth, Tiny URL, etc. and the datasetistrainedusingvariousmachinelearningalgorithms. Thesystemprocesses,analyzesthegivenURLandgivesthe resultwhetherthewebsiteisphishingorlegitimate.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page439
Fig. 4: RandomForestAlgorithm Fig. 5: GradientBoostingClassifierAlgorithm Fig. 6: SVMAlgorithm Fig. 7: HomePage
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page440
Fig. 8: UserRegisterPage Fig. 9: UserfillingtheLoginPage Fig. 10: Welcomepageisdisplayedwhenuserlogins Fig. 11: LoadURLDatasetPage Fig. 14: AccuracyoftheSelectedRandomForestAlgorithm Fig. 12: ViewURLDatasetpage Fig. 13: SelectAlgorithmModelPage Fig. 15: LegitimatewebsiteURLisgivenbytheuser

9. CONCLUSION

This paper included various machine learning techniques and methods to find phishing websites. We come to the inferencethatthelargerpartoftheworkiscompletedusing well-known machine learning techniques like XGBoost Classifier, Random Forest Classfier etc. Some authors suggestedanuptodatesystemforfrauddetection,similarto Phish Score and Phish Checker. In terms of exactness, correctness,recollect,etc.,featurecombinationswereused. Phishingwebsitesarebecomingmoreprevalenteveryday, thuselementsthatareusedtoidentifythemmaybeadded orreplacedwithnewones.

10. FUTURE SCOPE

Thereisalwaysaspaceofimprovementineverysystem. There are more classifiers such as the Bayesian network classifier,NeuralNetworks.Suchclassifierscanbeincluded andthiscouldbecountedinfuturetogiveamoredatatobe comparedwith.Theprojectcanalsoincludeothervariants of phishing like smishing, vishing, etc. to complete the system,sothesecanbeimplemented.Lookingevenfurther out,themethodologyneedstobeevaluatedonhowitmight handlecollectiongrowth.

REFERENCES

[1] MSomesha,AlwynRoshanPais,RouthuSrinivasaRao, Vikram Singh Rathour. “Efficient deep learning techniquesforthedetectionofphishingwebsites”,June 2020.

[2] E. Zhu, Y. Chen, C. Ye, X. Li, and F. Liu. OFS-NN: “An Effective phishing websites detection model based on optimal feature selection and neural network”. IEEE Access,7:73271–73284,2019.

[3] MahdiehZabihimayvanandDerekDoran.“FuzzyRough Set feature selection to enhance phishing attack detection”,032019.

[4] P. Yang, G. Zhao, and P. Zeng. “Phishing website detectionbasedonmulti-dimensionalfeaturesdrivenby deeplearning”.IEEEAccess,7:15196–15209,2019.

[5] T.Nathezhtha,D.Sangeetha,andV.Vaidehi.“WC-PAD: Webcrawlingbasedphishingattackdetection”.In2019 International Carnahan Conference on Security Technology(ICCST),pages1–6,2019.

[6] Y. Huang, Q. Yang, J. Qin, and W. Wen. “Phishing url detection via CNN and attention-based hierarchical RNN”.In201918thIEEEInternationalConferenceOn 55 Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference On Big Data Science and Engineering (TrustCom/BigDataSE),pages112-119,2019.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page441
Fig. 16: ThegivenURLisclassifiedasLegitimate Fig. 17: PhishingwebsiteURLisgivenbytheuser Fig. 18: ThegivenwebsiteURLisclassifiedasPhishing Fig. 19: AccuracyofAlgorithmsshowninBarGraph

[7] C. E. Shyni, A. D. Sundar, and G. S. E. Ebby. “Phishing detection in websites using parse tree validation”. In 2018RecentAdvancesonEngineering,Technologyand ComputationalSciences(RAETCS),pages1–4,2018.

[8] S. Parekh, D. Parikh, S. Kotak, and S. Sankhe. “A new method for detection of phishing websites: URL detection”.In2018SecondInternationalConferenceon Inventive Communication and Computational Technologies(ICICCT),pages949–952,2018.

[9] J.LiandS.Wang.“Phishbox:Anapproachforphishing validationanddetection”.In2017IEEE15thIntlConfon Dependable,AutonomicandSecureComputing,15thIntl ConfonPervasiveIntelligenceandComputing,3rdIntl ConfonBigDataIntelligenceandComputingandCyber Science and Technology Congress (DASC/ PiCom/DataCom/CyberSciTech),pages557–564,2017.

[10] H. Shirazi, K. Haefner, and I. Ray. “Fresh-Phish: A frameworkforauto-detectionofphishingwebsites”.In 2017 IEEE International Conference on Information ReuseandIntegration(IRI),pages137–143,2017.

[11] A. J. Park, R. N. Quadari, and H. H. Tsang. “Phishing websitedetectionframeworkthroughwebscrapingand data mining”. In 2017 8th IEEE Annual Information Technology, Electronics and Mobile Communication Conference(IEMCON),pages680–684,2017.

[12] Lisa Machado and Jayant Gadge. “Phishing sites detectionbasedonC4.5decisiontreealgorithm”.pages 1–5,082017.

[13] S.Haruta,H.Asahina,andI. Sasase.“Visual similaritybasedphishingdetectionschemeusingimageandCSS withtargetwebsitefinder”.InGLOBECOM2017-2017 IEEE Global Communications Conference, pages 1–6, 2017.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 05 | May 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page442

Turn static files into dynamic content formats.

Create a flipbook