International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072
![]()
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072
1
2
3
5
1Prof. Amar Palwankar, IT Engineering Dept & Finolex Academy of Management and Technology, Ratnagiri, India.
2 Rifah Solkar,, IT Engineering Dept & Finolex Academy of Management and Technology, Ratnagiri, India.
3Afiya Borkar, IT Engineering Dept & Finolex Academy of Managemnt and Technology, Ratnagiri, India.
4Shreya Khedaskar, IT Engineering Dept & Finolex Academy of Managemnt and Technology, Ratnagiri, India.
5Pranali Shingare, IT Engineering Dept & Finolex Academy of Managemnt and Technology, Ratnagiri, India. ***
Abstract - Cybersecurity has recently become a serious concern for computer systems due to the rise in Internet usage. Malicious refers to a desire to cause damage. Different harmful URLs release various forms of malware and attempt to collect user data. The use of internet services to conduct business while staying at home increased and changed as a result of the global lockdown in the year 2020. As a result, there were a rising number of cybercrimes committed by cybercriminals and significant data losses for businesses. Malicious URLs must be found and threat types must be recognised in order to halt these attacks. Such websites are frequently found using signature-based approaches, and attempts have been made to impose access restrictions on detected malicious URLs using a variety of security tools. In order to increase the effectiveness of classifiers for identifying dangerous websites using the Logistic Regression Technique of Supervised Machine Learning algorithm, this chapter suggests leveraging linguistic aspects of the related URLs. The findings demonstrate that the ability to recognise harmful websites based solely on URLs and categorise them as spam URLs without depending on page content would lead to significant resource savings as well as a user-safe surfing experience.
Key Words: SuspiciousURLDetection,MachineLearning, SupervisedLearning,LogisticRegression,Cybersecurity.
Duetotheexpansionandpromotionofsocialnetworking, onlinebanking,ande-commerce,therelevanceoftheWorld Wide Web (WWW) has drawn more and more attention. Newadvancementsincommunicationtechnologynotonly openupnewpossibilitiesfore-commerce,buttheyalsogive attackers new opportunities. These days, millions of such websitescanbefoundonlineandaresometimesreferredto asharmfulwebsites.Itwasstatedthatthedevelopmentof technology led to some strategies to target and con consumers,includingspamSMSinsocialnetworks,online gambling, phishing, financial fraud, false prize claims, and fakeTVshopping(Jeong,Lee,Park,&Kim,2017).
ResourcesontheInternetarereferredtobytheirUniform ResourceLocator(URL).Thefeaturesandtwofundamental partsofa URLaredescribedbySahooetal.Thesearethe protocol identifier, which determines the protocol to use,
andthe resource name, whichidentifiestheIP address or domainnamewheretheresourceislocated.EachURLhasa distinct structure and format, as can be seen. Attackers frequently attempt to alter one or more URL structural elements in an effort to trick users into sharing their malicious URL. Links that hurt people are referred to as maliciousURLs.TheseURLswillrerouteuserstowebsitesor resources where hackers can run malicious software on users' computers, send users to undesirable websites, harmfulwebsites.
TheassaultsusingthedistributingmaliciousURLstrategy arerankedfirstamongthe10mostpopularattackstrategies in 2019. According to this figure, the threat level and frequencyofassaultsusingthethreeprimaryURLspreading techniques malicious URLs, botnet URLs, and phishing URLs increase.
Basedonthestatisticsshowingariseinthedistributionof maliciousURLsoverthecourseofseveralyears,itisobvious thatapproachesormethodsmustbestudiedandpractised to identify and stop these bad URLs. The research also presentsanoveltechniqueforextractingURLattributes.
Therearenowtwoprimarytendencieswhenitcomestothe challenge of identifying malicious URLs: malicious URL identification based on indicators or sets of rules, and malicious URL detection based on behaviour analysis approaches. Malicious URLs can be rapidly and precisely detectedusinganapproachbasedonacollectionofmarkers orcriteria.Thisstrategy,however,isunabletoidentifynew dangerousURLsthatdonotmatchthespecifiedindications orguidelines.Basedonbehaviouranalysisapproaches,the method for identifying malicious URLs uses machine learning or deep learning algorithms to categorise URLs accordingtotheiractions.Inthisstudy,URLsarecategorised according to their properties using machine learning techniques. The publication also has a brand-new URL attributeextractionmethod.
Inourstudy,URLsarecategorisedusingmachinelearning algorithms based on their characteristics and behaviours. The properties are unique to the literature and are taken from the static and dynamic behaviours of URLs. The research's key contribution is those newly suggested features. The entire malicious URL detection system uses
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
machinelearningalgorithms.LogisticRegressionistheonly supervisedmachinelearningalgorithmemployed.
In order to have more information about Malicious Link Detection Systems which are already used to detect suspicious domains, IP’s and URL, we referred Research papers based on Malicious Link Detection System using Machine Learning. It gave us information about different techniquesusedtodetectmalwaresandotherbreacheswith theiradvantagesanddisadvantages.
[1] Mr. Mohammed Alsaedi student of CSE Engineering dept,Mr.FuadA.GhalebafacultyofUniversityTechnology Malaysia ,Mr.Faisal Saeed studentfromBirminghamCity University and Mr. Jawad Ahmad from Edinburgh Napier University, named “Cyber Threat Intelligence-Based MaliciousURLDetectionModelUsingEnsembleLearning” hadputforththeirfocusandpublishedtheirInternational article in (Sensors 2022, 22, 3373. https://doi.org/10.3390/s22093373).Thispaperdescribes thatamaliciouswebsitedetectionmodelwasdesignedand developed with a hypothesis stating that cyber threat intelligenceisaneffectiveandsaferalternativetoimprove thedetectionaccuracyofmaliciouswebsites.
[2] Mr. Shantanu & Mr. Janet B. from Department of Computer Application National Institute of Technology Tiruchirappalli, India. and Mr. Joshua Arul Kumar, Department of ECE MAM College of Engineering Tiruchirappalli, India had studied on the concept of detection of malicious URLs as a binary classification problem and evaluated the performance of several wellknownmachinelearningclassifiersentitled“MaliciousURL Detection” which was published in (International Conference on Artificial Intelligence and Smart Systems (ICAIS)|978-1-7281-9537-7/20/©2021IEEE).
[3] Mr. Zhiqiang Wang, Mr. Xiaorui Ren, Shuhao Li, Bingyan Wang, Jianyi Zhang and Tao Yang fromBeijing ElectronicScienceandTechnologyInstituteresearchedon maliciousURLdetectionmodelbasedondeeplearningsuch thatthesystemmodeluseswordembeddingmethodbased oncharacterembeddingbycombiningit,titled“AMalicious URL Detection Model Based on Convolutional Neural Network”publishedinresearchpaper(HindawiSecurityand CommunicationNetworksVolume2021,ArticleID5518528, https://doi.org/10.1155/2021/5518528).
[4] Mr. Jino S Ganesh, Mr. Niranjan Swarup.V, Mr. Madhan Kumar.R, Mr. Harinisree.A studentsofP.Gunder the guidance of Prof. Dr. Giri Raj.M of Mechanical Engineering, Vellore Institute of Technology, Tamil Nadu, India,hadworkedandmadeasystembyusingfourdifferent machine learning algorithms, namely logistic regression,
decisiontree,randomforest,multilayerperceptronneural networks to detect malwares and phishing sites entitled “Machine Learning based Malicious Website Detection” published in (International Journal of Scientific & EngineeringResearchVolume11,Issue7,July-2020).
[5] Mr. Doyen Sahoo, Mr. Chenghao Liu, Mr and Mr. Steven C.H. Hoi from School of Information Systems, SingaporeManagementUniversitydescribedthatMalicious URL detection plays a critical role for many cybersecurity applicationsbycategorizingthemintoBlacklistorHeuristic Approach and also used ML approach to classify different spamsandmalwaresnamed“MaliciousURLDetectionusing Machine Learning: A Survey”, published in International article (Vol. 1 August 2019, https://doi.org/10.1145/nnnnnnn.nnnnnnn).
[6] Mr. Ayon Gupta and Mr. Sanghamitra Giri underthe guidance of Prof. R. Naresh from Dept. of CSE, SRMIST, Chennai, India researched on the concept that malicious URLscanbedetectedinreal timebyusingML algorithms likeSupportVectorMachineandLogisticRegressiontotrain datasetsanddetectmaliciouslinkentitled“MaliciousURL Detection System using combined SVM and Logistic Regression Model” published in (International Journal of AdvancedResearchinEngineeringandTechnology,IJARET Volume11,Issue4,April2020).
[7] Mr. Cho Do Xuan & Mr. Hoa Dinh Nguyen ofPostsand TelecommunicationsInstituteofTechnology,Hanoi,Vietnam andMr.TisenkoVictorNikolaevichfromPetertheGreatSt. Petersburg Polytechnic University Russia described that maliciousURLscanbedetectedusingtwomachinelearning algorithms RF and SVM by analysing and extracting static behaviourofURLstitled“MaliciousURLDetectionbasedon Machine Learning” published in (IJACSA International JournalofAdvancedComputerScienceandApplications,Vol. 11,No.1,2020).
[8] Lastbutnottheleastwepreferredabookby Mr. Ferhat Ozgur Catak professor of University of Stavenger, Ms. KevserSahinbasfromIstanbulMedipolUniversityandMr. Volkan Dortkardes from Turkey titled “Malicious URL DetectionusingMachineLearning”byusingRandomforest and Gradient boosting ML algorithms to detect malicious URL which was published in book (USA by IGI Global EngineeringScienceReference).
Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page462
Sobasically,afterstudyingtheresearchandreviewpapers of various authors we found that various authors have created a URL/website which is system eco-friendly to Analysesuspiciousdomains,IPsandURLstodetectmalware and other breaches. There are many online websites availablefordetectionofspamandphishingURLswhichcan bedonebyenteringthelinkintheirsystem.Therefore,our system will scan and analyse the URL based on the ML
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072
approach and update the result, whether the URL is maliciousornotonasingleclickeitheritmaybefromsocial siteoremail.
A suspicious link is a malicious URL that is designed to promote virus attacks, fraudulent activities, scams and phishingattacks.Themethodisusedtoanalyzesuspicious URLsandpreventtheusersfrombeingattackedbythem.
ByclickingonaninfectedURL,themalwaresuchasvirus, trojan,ransomwaregetsdownloadedandcantakecontrolof yourdevicesbycompromisingyourmachine.Wheneverthe userclicksonanylink providedintheemail oranysocial networkingsite,thetrainedmodelwillidentifywhetherthe linkissuspiciousornot.ThegoalistoclassifyURLsgivenas inputstopredictiftheyaredangerousorinoffensive.
Tobuildthismodel,wewilluseadatasetwithURLslabelled both bad and good. We selected BAD as a label for the malicioususersandGOODforthelegitimateones.Wewill train the model using a dataset with many URLs as text already labelled as good and bad. To provide a quick and better view of the data, it is handled and explored to the users in graphical form. For this, data exploration is performed to identify the good and bad URLs using data visualizationtechniqueslikebargraph,piechart.
The learning algorithms would provide the feature extractionoftheURLspresentinthedataset.Theprovided URLwillreadonebyoneforextractingthefeaturessuchas suspicious characters, no. of dots and slashes, etc. The techniqueusedinthismodelis"Bagofwords"forextracting features.TheURLsarecomposedofwordssuchasdomain name, path, file, extension. This technique works with numerical features and helps to convert words into numericalvectors.ItisdonebyapplyingNaturalLanguage processing.
ThemodelusedisLogisticregressiontofindtheprobability ofacertainclass.Afterinitializingthealgorithm,wewillfit thealgorithmintoourtrainingdatasetforlearningpurposes.
We divided the dataset in a training test used to fit the features and feed the model. The URL is checked in the database and if it is present in the database, it has been already checked as malicious or not. But if the URL is not present in the database, then it will go through all the operationsandprovidetheresulttotheuserasmaliciousor not.
NextstepistheidentificationoftheURLs.Thesystemwill check and shortlist the link based on directories. The standardURLswillapplytothewhitelistandtheblacklist directory will include malicious URLs. Another way of checking the URL as malicious is through the filtering process. The model will check for keywords like "com," "www,"etc.insidetheinvaliddomainnameandifitcontains
morethanfournumbersinthedomainnameitislikelytobe amaliciousURL.Also,thepresenceofspecialcharactersand anyofsomefamousdomainsintheURLwouldalsoleadto maliciouscontent.
Then with the test set we validate our model with an unbiasedevaluation.Themodel learnsduringthetraining phase from the dataset and is used to make predictions. GoodURLshavebeencorrectlypredictedasauthenticURLs andbadURLsasmaliciousURLsandarelabelledasGOOD andBADrespectively.TheresultsobtainedclassifytheURLs asgoodorbadtopredictiftheyaresuspiciousorlegitimate touse.
Thisistheflowchartthathowthesystemwillproceedand simulateitsprocessfordetectionofmaliciouslink.
Followingarethescreenshotsofourresultsobtainedfrom oursystem.
Factor value: 7.529 | ISO 9001:2008
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072
savedinformationwhichcanbeexploitedbyattackersand thustheycanbecomevictimofsuchmaliciousandphishing websites.Therefore,thedetectionofharmfulwebpageshas become very important to protect the users of the web environmentfromthesethreats.So,toovercomesuchwith suchsituationsMaliciousURLdetectionplaysacriticalrole formanycybersecurityapplications.Thetechniquesusedin machinelearningarepromisingmethod.
We provided an organised system for malicious URL detection with the help of machine learning. Also, we provided detailed explanation of current research on the detection of malicious link, by creating representation of featuresetsandnewlearningalgorithmsplottingfordealing withthedetectionofmalignantURL.WehadusedLogistic Regression machine learning algorithm which is more convenient than Random Forest or Naive Bayes for suspicious link detection. The experimental results of the proposedmethodindicatethattheperformanceofMLmodel in processing large dataset and predicting the website as benignormaliciousissignificantlygood.Thisindicateswe canquicklybuilddeployableandreliablemachinelearning modelsformaliciouslinkdetection.
Theresearchteamsuccessfullyproposedamethodwhere URLs can be used directly to extract features and classify them as good or bad. The study is inspired by this and focusesonthismethodologyonlyhenceinordertograspthe global features of malicious URLs and extract highdimensional features based on pre-processed data, deep learning algorithms are potentially worthwhile topics for future research studies that we will be taking under consideration. Deep learning has become the mainstream maliciousURLsdetectionsystemthesedays.Deeplearning canautomaticallyextractfeatureswhichfreesupthetime andfeatureengineering.
Theresultingoutcomeofthetrainingalgorithmabouthow efficient it is to detect the malicious URL website and algorithmwithgoodaccuracygivesusbetterclassification resultsofanywebsiteeitheritisGoodorBad.Thisiseasyto understandbyanyuseranditmakesthemalertthattheyare usingamaliciouswebsiteandwhichcansavethemfroman onlinescam,onlineattack,hacking,phishingandcredentials detailsteal.
Withtheevolutioninsystemtechnologyandemergingrise on the internet, millions of people exchange their informationoverthesocialsitesandalsodomanyactivities relatedtotheirdailylife.Duringtheseprocesses,usershave intelligence and critical information such as descriptive username & passwords and mostly networks detect their userswiththem.Mostoftheusersareunawareabouttheir
Implementing Deep Learning will not only allow us to process the data faster but also will improve the performanceofmaliciousdetection.Inordertokeepbehind the drawback of time-consuming and labour-intensive machine learning implementation which extracts shallow features,deeplearningimplementationwillbepreferred.
The future scope solely is focused on study findings and implementationofdeeplearningintoourmajorprojectin thegiventimebeing.
[1] Mohammed Alsaedi, Fuad A. Ghaleb, Faisal Saeed, Jawad Ahmad_(2022) Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. International article in (Sensors 2022, 22, 3373.https://doi.org/10.3390/s22093373).
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056 Volume: 09 Issue: 11 | Nov 2022 www.irjet.net p-ISSN:2395-0072
[2]Shantanu,JanetB,JoshuaArulKumarR_(2021)Malicious URL Detection.(International Conference on Artificial IntelligenceandSmartSystems(ICAIS)|978-1-7281- 95377/20/©2021IEEE).
[3]ZhiqiangWang,XiaoruiRen,ShuhaoLi,BingyanWang, JianyiZhang,TaoYang_(2021) AMaliciousURLDetection ModelBasedonConvolutionalNeuralNetwork. (Hindawi SecurityandCommunicationNetworksVolume2021,Article ID5518528,https://doi.org/10.1155/2021/5518528).
[4] Jino S Ganesh, Niranjan Swarup.V, Madhan Kumar.R, Harinisree.A and Dr. Giri Raj.M_(2020) Machine Learning basedMaliciousWebsiteDetection. (InternationalJournalof Scientific&EngineeringResearchVolume11,Issue7,July2020).
[5]Doyen Sahoo, Chenghao Liu, Steven C.H. Hoi_(2019)MaliciousURLDetectionusingMachineLearning: A Survey International article (Vol. 1 August 2019, https://doi.org/10.1145/nnnnnnn.nnnnnnn).
[6] Ayon Gupta, Sanghamitra Giri, R. Naresh_(2020)Malicious URL Detection System using combinedSVMandLogisticRegressionModel.(International Journal of Advanced Research in Engineering and Technology,JARETVolume11,Issue4,April2020).
[7]Cho Do Xuan, Hoa Dinh Nguyen_(2020) MaliciousURL DetectionbasedonMachineLearning.(IJACSAInternational JournalofAdvancedComputerScienceandApplications,Vol. 11,No.1,2020).
[8] Mr. Ferhat Ozgur Catak professor of University of Stavenger, Ms. Kevser Sahinbas from Istanbul Medipol UniversityandMr.VolkanDortkardesfromTurkey_(2020) (USAbyIGIGlobalEngineeringScienceReference).
2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page465