Phishing Website Detection Using Machine Learning

Page 1

Phishing Website Detection Using Machine Learning

Department of Computer Science and Engineering

Priyadarshini College of Engineering

Nagpur, India

Kaustubh Shastrakar

Department of Computer Science and Engineering

Priyadarshini College of Engineering

Nagpur, India

Anjali Sulakhiya

Department of Computer Science and Engineering

Priyadarshini College of Engineering

Nagpur, India

Shrutika Satange

Department of Computer Science and Engineering

Priyadarshini College of Engineering

Nagpur, India

Abstract - Phishing websites, which pose as reputable websites and mislead unwary visitors into disclosing sensitive information, are one of the main sources of security breaches in the modern digital era. The easiest method of obtaining sensitive information from unwitting people is through a phishing attack. The goal of phishers is to get crucial data, such as login, password, and bank account information. By taking advantage of the user's vulnerability, a hacker may be able to obtain information such as bank account numbers, passwords to social media accounts, firm incomereports,andthedetailsofonlinetransactions,to name a few. This research seeks to identify phishing URLs and identify the most effective machine learning approach based on precision, false-positive rate, and false-negativerate.

I. INTRODUCTION

Phishing is the practise of tricking an individual through an electronic connection in order to get sensitive data like usernames, passwords, and credit card numbers. Customers are frequently encouraged to input personal information on a fake website that looks and feels precisely like the genuine one through email spoofing or instantmessaging,whichishowit'sgenerallydone.Oneof the most harmful and hazardous illegal activities that is expanding in online. Users who utilise the internet to obtain the services it provides have been quickly falling victim to phishing assaults over the past several years on purpose.

Inordertocollectsensitiveinformation,thecrooksfirst make unofficial copies of legitimate websites and emails, typicallyfrombankinginstitutionsorotherbusinessesthat

***

Jay Dhurat

Department of Computer Science and Engineering

Priyadarshini College of Engineering

Nagpur, India

Nikhil Miralwar

Department of Computer Science and Engineering

Priyadarshini College of Engineering

Nagpur, India

dealwithfinancialinformation.Thewordsandlogosofan authentic firm will be used to construct the email. One of the factorscontributing to theInternet's fastexpansion as a communication medium is the nature of website construction, which also makes it possible to misuse the trademarks, trade names, and other corporate identifiers that customers have come to rely on as procedures for identification.

The "spoofed" emails are then distributed to as many individuals as possible in an effort to deceive them. Customers are routed to a fake website that seems to be from the real company when they receive these emails or clickalinkinthem.

Internet users are at risk from a number of cyber dangers, such as identity theft, theft of personal information, and financial losses. As a result, internet usage at home and at work can be dubious. Users should be able to recognise and protect against privacy leaks using efficient analytical tools in order to reduce security risks. An information security management system based on artificial intelligence should be used to construct efficient systems that can enhance self-intervention at the momentofanattack.

II. Literature Survey

Phishing is a technique used to steal data, money, or personal information using a false website. The greatest methodforpreventingcontactwiththephishingwebsiteis to identify dangerous URLs in real-time. Identifying phishing websites depends on their domains. They often havesomethingtodowithURLs(low-levelandupper-level domains, paths, and queries) that need to be registered.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page621

Utilizingdistinguishingfeaturestakenfromthewordsthat make up a URL and query data from several search engines, like Google and Yahoo, it is possible to evaluate recently acquired state of intra-URL relationships. The machine-learning based classification for the detection of phishing URLsfrom a real datasetis further influenced by theseattributes.

This research uses phish-STORM to focus on real-time URL phishing versus phishing material. In order to distinguish between phishing and non-phishing URLs, a fewrelationshipsbetweentheregistrationdomainandthe remainderoftheURLaretakenintoconsiderationforthis. Certain common blacklisted urls are used to identify phishing websites, although this method is ineffective because phishing websites only exist for a brief period of time.Thepractiseisknownasphishing.Itisthepractiseof misleading a company'sclientstocommunicate with their sensitive information in an unethical manner. It is also possible to describe it as the deliberate use of harsh tools like spam to automatically target the victims and collect theirprivateinformation.

Thereismorecommunicationavailableforthedelivery ofmaliciousmessagessincemanyoftheSMTPfailuresare exploitation channels for phishing websites. A brand-new feature extraction method for categorization that uses heuristics was proposed. In this, they have categorised extracted characteristics into many categories, such as features that obfuscate URLs and features that are dependent on hyperlinks. Additionally, the suggested approach provides 92.5% accuracy. Additionally, the number,quality,andfeatureextractionofthetrainingdata aretheonlyfactorsthataffectthismodel.

III. Methodology

Ourprojectwascreatedutilisingawebsitethatservesasa software for all users. It will be possible to tell whether a website is real or phishing by using this engaging and responsivewebsite.

Reactwasusedtocreatethiswebsite.Reactisafront-end JavaScript toolkit that isfreeandopen-sourcefor creating user interfaces based on UI components. It should be emphasisedthatthewebsiteisintendedforallusers,thus itmustbesimpletouseandnousershouldencounterany difficulties.

Thewebsiteprovidesdetailsabouttheservicesweoffer.It also includes information about unethical behaviours occurringinthetechnologyworldoftoday.Thewebsiteis designed with the intention of educating users about the malpractices taking place in today's society as well as enabling them to distinguish between authentic and fake websites. They can avoid someone attempting to utilise their personal information, such as their email address,

password, debit card number, credit card number, CVV number,bankaccountnumber,andsoon.

1) Dataset Import

Import a dataset from Kaggle.com that contains both genuine and phishing URLs, designated as "0" for trustworthywebsitesand"1"formaliciouswebsites.

2) Data preprocessing

includes purging, instance picking, feature extraction, normalisation, transformation, etc. The entire training dataset is the end result of data preparation. Data pretreatment may influence how the final processing's results are interpreted. Data cleaning might involve filling in the gaps in the data, reducing noise, identifying and eliminating outliers, and addressing incompatibilities. A technique for adding precise databases or data sets is calleddata integration.Data transformation is the process of gathering and normalising data in order to measure a certain set of data. By doing data reduction, we may provide a very brief summary of the dataset that nonethelesscontributestothesameanalyticalresult.

3) Trained ML Model

UsedGoogleColabtotrainthemodelwithfeaturessuchas:

o URL redirection: If the URL path contains "//," the feature is set to 1; otherwise, it is set to 0. The visitor will be moved to another website if the URL path containsthesymbol"/"

o Length of Host name: The average length of benign URLsis25,andiftheURLislongerthan25,thefeature issetto1;otherwise,itissetto0.

o URL Shorten Services "TinyURL": With the use of the TinyURL service, phishers may disguise lengthy phishing URLs as small ones. User traffic is being diverted to fraudulent websites. If the URL is shortenedusingaservicelikebit.ly,thenthefeatureis setto1,otherwiseitissetto0.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page622

o Existenceof@symbol inURL: IftheURL containsthe @ sign, the feature is set to I; otherwise, it is set to 0. When phishers add a specific @ sign to a URL, the browser ignores everything before the "@" symbol andfrequentlyskipstothetrueaddressafterit.

o ExistenceofIPaddressinURL:Thefeatureissetto1if theURLcontainstheIPaddress;otherwise,itissetto 0. The majority of trustworthy websites never use an IPaddressastheURLtodownloadawebpage.Theuse ofan IPaddressin a URL suggests thattheattacker is attemptingtostealsensitivedata.

o Information submission to Email: Using the "mail ()" or "mailto:" methods, the phisher can send the user's data to his own email. If the URL contains such functions,thefeature isset to1; otherwise, it issetto 0.

o Number of slash in URL: The average number of slashes in benign URLs is 5. If that number is higher, thefeatureissetto1;otherwise,itissetto0.

o URLofAnchor:Youhaveobtainedthisfunctionalityby crawling the URL and its source code. The a> element specifiestheURL oftheanchor.Thefeatureissetto1 if the a> tag has a maximum number of hyperlinks fromanothersite;otherwise,itissetto0.

MLP, or a multilayer perceptron, was employed. Another name for multi-layer perception is MLP. It is made up of thick, fully linked layers that may change any input dimension into the required dimension. A neural network with numerous layers is referred to as a multi-layer perception. In order to build a neural network, we join neuronssothatsomeoftheiroutputsarealsotheirinputs.

4) Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a method of data analysis that offers several methods and is largely diagrammatic,asseenbelow.Itenhancestheperceptionof adatacollection,revealstheunderlyingstructure,extracts keyparameters,findsoutliersandabnormalities,andtests theHeatmap'shiddenaudacity.

5) Develop API and host it on render.com

API stands for Application Programming Interface, which is a set of definitions and protocols for building and integratingapplicationsoftware.

APIs are mechanisms that enable two software componentstocommunicatewitheachotherusingasetof definitionsandprotocols.

ThesourcecodeforourAPIisathttps://github.com/profmoriarty/fishyapianditishostedatrender.com.

Renderisaunifiedcloudtobuildandrunallyourappsand websites with free TLS certificates, a global CDN, DDoS protection, private networks, and auto deploys from Git. Render deploys the API from github and hosts it on its servers.

6) Develop a frontend with React and host it on Github Pages

The frontend is made in React. React is a free and opensource front-end JavaScript library for building user interfacesbasedonUIcomponents.

The frontend for this project is hosted on Github Pages at https://prof-moriarty.github.io/fishy0/

7) Working

A URL is entered into the search field, and the Scan button is then clicked. The URL is then forwarded to the render.com API, where it is scanned and examined using themodelcreatedbefore.

TheinputURLwillbescannedandseparatedintorealand bogus URLs using a model trained with Multilayer Perceptron. After being scanned, the input URL is given a likelihood score (in the form of %). The cutoff for this likelihoodscoreis70%.TheURLisverylikelytotakeyou to a phishing attempt if the score is higher than 70%. The URLismorelikelytobesecureifthescoreisunder70%.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page623

IV. Conclusion

Phishing is a type of criminal behaviour that uses social engineeringmethods inthe computingindustry. Themain goal of early phishing attempts was to get access to the victim's AOL accounts, or sporadically to steal credit card information for fraudulent use. The majority of phishing techniques include some kind of technical deception plan to make a link in an email seem to come from the fake company. Significant security issues still exist in reducing the number of unprotected PCs that feed botnets, combating the rise in spam email, stopping organised crime, and warning Internet users about the dangers of social engineering. The objective of this study's future workistocreateanunsuperviseddeeplearningtechnique that can extract knowledge from URLs. The research can also be expanded in order to get results for a bigger networkwhilemaintaininganindividual'srighttoprivacy.

According to our research, web phishing may be identified by uztilising a classifier and a multilayer perceptron machine learning system. Our study demonstrates that classifiers perform better when more features are used as training data and when the most

important characteristics are used as training data. Currently,wehaveclassifiersthatdetectphishingwebsites with high accuracy. The provocation in this area will be that scammer will continue to improve the URLs and design of phishing websites so that they resemble legitimate websites. It is thus vital to enhance current featuresandaddnewonesforphishingdetection.

V. References

[1] Joby James, Sandhya L., Ciza Thomas; "Detection of phishing URLS using machine learning techniques"; International Conference on Control Communication andComputing(ICCC);2013;

DOI:10.1109/ICCC.2013.6731669

[2] M Selvakumari, M Sowjanya, Sneha Das, S Padmavathi;"Phishingwebsitedetectionusingmachine learning and deep learning techniques"; Journal of PhysicsConferenceSeries:2021;

DOI:10.1088/1742-6596/1916/1/012169

[3] Rishikesh Mahajan, Irfan Siddavatam,"Phishing Website Detection using MachineLearning Algorithms; InternationalJournalofComputerApplications";2018;

DOI:10.5120/ijca2018918026

[4] Arun Kulkarni, Leonard L. Brown; "Phishing Websites Detection using Machine Learning", International Journal of Advanced Computer Science andApplications(IJACSA),Volume10,2019;

(DOI)10.14569/IJACSA2019.0100702

[5]ArathiKrishnaV,AnusreeA,BlessyJose,Karthika Anilkumar,OjusThomasLee,"PhishingDetectionusing Machine Leaming based URL Analysis: A Survey, NationalConferenceonNovel&ChallengingIssuesand Recent Innovations in Engineering and Information Sciences (NCREIS); 2021, DOI: 10.17577/1JERTCONV9IS13033

[6]J.Kumar,A.Santhanavijayan,B.Janet,B.Rajendran and B. S. Bindhumadhava, “Phishing Website Classification and Detection Using Machine Learning,” 2020 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 2020, pp. 1–6, 10.1109/ICCCI48352.2020.9104161.

[7] HassanY.A.andAbdelfettahB,“Usingcase- based reasoning for phishing detection", Procedia Computer Science,vol.109,2017,pp.281–288.

[8] Rao RS, Pais AR. Jail-Phish: An improved search engine based phishing detection system. Computers & Security.2019Jun1;83:246–67.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page624
8) Comparison between different algorithm for accuracy

Turn static files into dynamic content formats.

Create a flipbook