MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID

Bindu P1 , Chandana K S2 , Ranjith U3 , Chandanraj R J4

Professor Krupa K S, Dept. of Information Science and Engineering, Global Academy of Technology, Karnataka, India ***

Abstract - Smartphones have become indispensable in modern life as a result of their extensive use in recent years. New solutions have been developed by users to allow them to keep critical data on their mobile devices. Attackers' main focus, however, is on data related to privacy. As a result, hackers constantly develop new methods to steal data from users' devices. To guarantee the defence of users' confidential information from intruders, several antimalware solutions are created. Based on how they detect malware, we classifya lot of recent antimalware techniques. Our goal is to present a clear and brief overview of malware detection and defence procedures. We provide an ANN and SVM-based technique to identify malicious and good apps in this study.

Key Words: Android Malware, Smartphones, Machine learning, SVM, ANN

1. INTRODUCTION

Mobilephonesandothersmartdevicesarevitalinpeople's livesduetotheirfeaturesandscalability.Theyinteractwith people in a variety of settings, including the workplace, entertainment, money management, etc. Smartphones are themajortargetforattackerssincetheyarewherepeople save their important data. Every day, new methods for attackerstoobtaindatafromasmartphonearedeveloped. Androidmalwareissoftwareorcodecreatedspecificallyto interfere with, harm, or intrusion into a system. Another nameforitisharmfulsoftware.

IntheareaofITsecurity,malwareisaserioushazardand source of concern. In terms of privilege escalation, tariff theft,remotecontrol,andprivacyleakage,itposesathreatto systemandusersecurity.

The system will be easily targeted by malware because it allowsuserstoinstallunlicensedorunapprovedsoftware. MalwaredetectionforAndroiddevicesbecomescrucial.

Withthisproject,weofferaworkablemethodforlocating mobile apps. Although the design and publication of their Androidapplicationsareconvenient,themarketforthirdpartyapplicationsfacesdifficulties.Thisdoesnotguarantee smartphone security. The drawback of signature-based malware detection technology is that it allows users to identifyspywarethattheyareunawareof.

1.1 Malware Classification

Malware is a software program that enters into a user’s device without permission and has intentions to cause damageorstealpersonalinformation.

Virus: Virens sneak into the system and infect users or spread throughout it by affixing themselves to other executables. The virus "Creeper," one of the earliest varieties,wascreatedin1971asatestrunforaprogramme designedbyBobThomasatBBN.Asitdidnotdamagethe data, it was not malicious software that was actively operating.

Trojans: Trojans are malicious files that masquerade as helpfulfilesinordertogainaccesstoasystem.

Worms: A sort of virus that may duplicate itself automatically on various PCs and gadgets and propagate across the internet. They can spread themselves without becomingapartofotherprogrammes.

Spyware: The term "spyware" refers to a potentially unwanted programme (PUP). It is an undesirable programme that aims to steal personal information (such userpasswordsorbankaccountinformation)andInternet usagedatawithouttheuser'sknowledgeorconsent.

Adware: Atypeofspyware,adware.Yet,theadwaredoes notwanttoharmthePC.Itsprimarygoalistopiqueauser's curiositysothatcomparableadvertising, emails,andpopupscanbeshowntothem.

Ransomware: Ransomwareisatypeofmalicioussoftware in which an attacker encrypts all the victims' files and demandspaymenttogetthemdecrypted.

1.2 Malware Features

Malwareidentificationhastwostages:featureextractionand classification/clustering.Malwarecharacteristicsrefertothe datathatcanbeusedformachinelearningalgorithms.These specify the universe of ideas that can be represented by machine learning. Malware features can be divided into staticanddynamicgroups.Thecharacteristicsthatcanbe extracted without the use of malware are known as static features. These characteristics are derived from malware binariesthroughanalysis.Whiledynamiccharacteristicsare thosethatareobtainedafterabinaryfilehasbeenexecuted. The dynamic analysis process is used to extract these

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page945

characteristics.Malwareis categorized andgroupedusing machinelearningmethods.Onlyclassificationtechniquesare described in this paper. Some of the features of malware detectionwerementionedinthispart.

1.2.1 N-Gram

It is one of the more typical static characteristics and comprisesofnconsecutivebytestakenfromtheoutputof thehexdump.Themostwidelyusedcharacteristic,4Gram, refers to taking a combination of 4 bytes. You can use NGraminoverlapping or non-overlappingsituations. When thereisnocrossover,abytethathasalreadybeenusedonce canbeusedagaininthefollowinggramme.

1.2.2 Opcodes

Opcodesrefertovariousmachine-levelprocessescarriedout byprogrammableexecutables.Youcanfindtheseopcodesin theassemblycode.

1.2.3 Strings

Thedefinitionofastringisagroupofprintableletters.They discoveredthattheheadersinPE-formatcontainplaintext that can be used to retrieve data. Additionally, non-PE executableshaveencodedstringsaswellthatcanbeused withtheir information.A more precise meaning ofstrings statesthattheycanonlybeusediftheyare"interpretable" andmakesomesortofsemanticsense.

1.2.4 Memory Access

Theprimarymemoryisusedtostorealotofinformation, including configuration, network activity, and window registrykeys. Thus,bylookingathow memory is utilized, crucialinformationaboutmalwarecanbegleaned.

1.2.5 API Calls

TheApplicationProgrammingInterface(API)actsasalink betweentheoperatingsystemandapplications.Someduties, such as writing a file to the disc, can only be completed directlybytheoperatingsystem,andthesystemlibraryhas alibraryofthesefunctions.Acalltothesystemlibrarymade byanapplicationisknownasanAPIcall.Forinstance,the CopyFileWAPIwillbeusedwhenafileneedstobecopied. ManyexpertsuseAPIcalls,whichareacrucialcomponentof malwaredetection.AnAPItraceiswhatwetermasequence ofAPIcallsthatcanbepreserved.ThisAPItracecanidentify advancedmalwarebehaviorslike"walkingthroughfolders" and"copyingitselftodisc".

2. LITERATURE SURVEY

[1] “Android Malware Detection through Machine

Learning Techniques: A Review”: The author of this articleusedavarietyofmethodologies,includingRandom

Forest,SVM,andDecision Tree,anddeterminedthat both highaccuracyandefficiencycanbeattained.

[2]“On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities”: Aninnovativeprocedural taxonomy for ML-based Android malware detection was presentedbytheauthorinthisarticle.Hetalkedaboutthe sourcesofmaliciousandbenignAPKsaswellasthekindsof static and dynamic characteristics that researchers have gleanedfromthem.Healsolookedintohowtogetridofthe less useful aspects. It is possible to use and classify ML systems.

[3]“Analysis of Android Malware Detection Techniques: A Systematic Review”: Theauthorofthispaperprovideda comparative analysis of various Android mobile malware detectionmethods.Throughacriticalanalysis,thisresearch wasabletoidentifyalloftheweaknessesandadvantagesof eachdetectionmethod.Thefindingssupporttheclaimthat detection techniques created for Android viruses do not alwaysresultin100%accuratedetection.

[4] “Detection of Android Malware using Machine Learning and Deep Learning Review”: Based on the results of this study, experiments were conducted in this paper to select authorization and API-related information features for machine learning. According to the findings, evolutionaryautomatedprocessfeatureselectionwasmore advantageousthanacommonknowledgegain.Thegenetic algorithm still beats non-selection in terms of model generation time, even though its attribute selection performancewasgenerallyreducedbylessthan3%.

[5] “AndroidMalwareDetectionUsingMachineLearning Classifiers”: The author suggests that category-based machine learning classifiers boost the effectiveness of the classification.Machinelearningalgorithmshavebeenused to train classifiers with attributes of malicious apps and construct models that are capable of detecting dangerous patterns in the static analysis of Android malware. The authorcreatesaprofileofthesetoffeaturesforthecategory ofthetop-ratedapplicationsinthatcategory.Todetermine whetheranapphasbenigncharacteristics,wecompareits features to those that are required to provide the functionalityofthecategorytowhichtheappbelongs.

[6] “Android Mobile Malware Detection Using Machine Learning: A Systematic Review”: Accordingtotheauthor, classificationaccuracyisimprovedbyusingcategory-based machinelearningmodels.Inordertobuildmodelsthatcan identifypotentiallyhazardouspatternsinthestaticanalysis ofAndroidmalware,machinelearningalgorithmshavebeen used to teach classifiers with characteristics of malicious apps. The author compiles a list of features for the most popularapplicationsin eacharea.Bycontrastinganapp's features with those needed to provide the utility of the

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page946

categorytowhichitbelongs,wecandeterminewhetherit hasbenigncharacteristics.

[7] “AnAndroidMalwareDetectionLeveragingMachine Learning”: Theauthorlookedintotheeffectivenessoffour machine learning algorithms that try to identify malware basedonpermissionsandactionrepetition,alsoknownas staticanddynamicfeatures.Hedividedtheundertakinginto threephases.Thefindingsshowthatclassificationprecision wasverygood.Theresultsalsodemonstratedagreatlevelof accuracy.Therefore,forcategorization,usingstaticanalyses aloneshouldbeeffectiveandinexpensive.

[8] “A Review of Android Malware Detection Approaches based on Machine Learning”: The study into Android malware detection is done in this paper. Authors used supervised learning techniques like Multinomial NB, Random Forest, and SVM. In addition to execution time, metrics like precision, recall, and F-measure are used to assess the outcomes. In comparison to other models, the SVMmodelperformedbetterintermsofprocessingtime.

[9] “AsurveyofAndroid MalwareDetectionTechnology Based on Machine Learning”: In this paper, the author used the vertical comparison technique to analyse the algorithm model, fundamental concepts, datasets, and performancemetricsoftheexistingmethods.Incomparison, themachinelearning-basedstaticdetectiontechniquehas advantages in accuracy and requires fewer detection experts.

[10]“Machine -Learning based analysis and classification of Android malware signatures”: This study examined 259,608malwaresignaturesthatweredetectedinavariety of Android apps. To allow cross-engine analysis, the signatureshavebeennormalisedintoacommonnamespace using the Signature Miner tool. Then, malware signatures have been examined by grouping the families into three categories, according to the hazard and nature of each threat:Adware,Harmful,andUnknown.

3. EXISTING METHODOLOGY

After looking over the contributions made by different authorsintheliteraturestudy,wediscoveredthatmalware detection in the current system is done using either a signature-basedmethodoraheuristicmethod.Theexisting models'trainingisprimarilybasedonsimpleclassification algorithms, which makes it difficult to more accurately optimise the dataset. High false positive rates may result from some new malware signatures that are not yet in malwarerepositoriesfailingtoberecognisedandasaresult failingtoidentifythemalware.

4. PROPOSED METHODOLOGY

Theproposedapproachdetermineswhetheranapplication ismalwareornotbyclassifyingitusingtheidealmachine

learningalgorithm.Inourundertaking,weuseGridsearch andSVCtofine-tunethemodel.Weprimarilyconcentrateon increasingtheprecisionoftheclassificationoftheAndroid applicationintheprovideddatasetusingGridsearch.

A. Dataset Creation

Wewillfirstdownloadthedatasetofbenignandmalicious software.Adware,Ransomware,Scareware,PremiumSMS, SMSMalware,andBenign2015,Benign-2016,andBenign2017apksareallincludedinthecollection.Thedownloaded dataset will then be unzipped, and malicious and benign folders will be created for the appropriate applications. Adware, Ransomware, Scareware, PremiumSMS, and SMS Malwarearenowincludedinthesourcesformalware,while Benign-2015,Benign-2016,andBenign2017areincludedin thesourcesforinnocuoussoftware.

B. Data Pre-processing

Thenumberofbothmaliciousandpositiveapplicationswill then be counted, and both counts will be returned. Then, we'llusetheandroguardtooltoextracttherightsandlook only for those that begin with permission. The file "permissions.txt"containsallthepermissionsthathavebeen extracted from all applications. We will now individually checkeachapkforrights.Itismarkedas"1"ifthespecific apk is using that authorization; otherwise, it is marked as "0".TheCSVfilelabelsthese0sand1swithrights.TheCSV's finalentryfallsunderthebenignormalignantcategory.A CSVfilecontainingthispreprocessedinformationiskeptfor useinfuturemodeltraining.

C. Training the model

Support Vector Machine (SVM) and artificial neural networks are used to develop the model (ANN). From the labeled examples, these models are trained to extract the requiredweightsandbiases.Usingbinarycross-entropy,the modelisbuilt,andonlytheepochsthatgeneratebetterdata thanthepreviousepocharechosen.Over100epochshave beenusedtoteachthemodel.Ifthereisnoimprovementin accuracyoverasetnumberofepochs,themodel'straining willendimmediately.

D. Evaluation of the model against the validation set and predicting the accuracy and loss

The model's performance is measured after it has been comparedtothevalidationandtestdatasets.TheF1valueis computed.Finally,theaccuracyandcostarecalculated.The model is then preserved and transferred into apk file for front-endusewithFlask.

E. Fine-tuning to find the best possible accuracy.

Inthissection,weattempttomodifythemodelinafewways thatwillhelpusimproveitsoverallaccuracy.InSVM,we've

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page947

usedSVCwiththeGridSearchtechnique,whichemploysthe best parameters to be specified in the kernel 'rbf' to finetunethemodel.Ifweweretorunthemodelrightnow,the accuracywouldhaveimproved,increasingitsefficiency.

D. Implementation of a front-end application to determine whether a particular apk contains malware or not.

Afterbeingpreserved,themodelisdroppedintoafront-end application.Tosubmittheapkfileforwhichthepredictionis to be made, a user-friendly and interactive website using webprogramminglanguageslikeHTML,CSS,andFlaskmust becreated.Theuseristhenshowntheforecast.

Review.InternationalJournalofOnlineandBiomedical Engineering(iJOE).16.14.10.3991/ijoe.v16i02.11549.

[2]. Koushki, Masoud & Abualhaol, Ibrahim & Durai Raju, Anandharaju & Zhou, Yang & Giagone, Ronnie & Shengqiang, Huang. (2022). On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities.Cybersecurity.5.16.10.1186/s42400022-00119-8.

[3]. Ashawa, Moses & Morris, Sarah. (2019). Analysis of AndroidMalwareDetectionTechniques:ASystematic Review. International Journal of Cyber-Security and DigitalForensics.8.177-187.10.17781/P002605.

[4]. Joshi,Prof.(2022).DetectionofAndroidMalwareusing Machine Learning and Deep Learning Review. International Journal of Recent Technology and Engineering (IJRTE). 11. 134-139. 10.35940/ijrte.A6963.0511122.

[5]. Ali,Huda&Oh,Tae&Fokoue,Ernest&Stackpole,Bill. (2016). Android Malware Detection Using CategoryBased Machine Learning Classifiers. 54-59. 10.1145/2978192.2978218.

[6]. Senanayake,Janaka&Kalutarage,Harsha&Al-Kadri,M. Omar.(2021).AndroidMobileMalwareDetectionUsing Machine Learning: A Systematic Review. Electronics. 10.1606.10.3390/electronics10131606.

5. CONCLUSIONS

OntheAndroidplatform,researchershavelookedintothe use of ML for automated malware detection. Building an intricate, multi-staged workflow is necessary for machine learning. Because of this, it has been challenging for new researcherstounderstandthestate-of-the-artinthisarea. Overtime,therehasbeenasharpriseinsmartphoneusage. Malware writers have used the significant increase in Androiduserstosimultaneouslytargetandharmusers.This paper reviewed current frameworks and algorithms for machinelearning-basedmalwaredetection.100%accuracy witha0%falsealarmratewasthebest.Theidentification rateswererespectively83%and90%.Toidentifymalware with high accuracy and detection rate, we need a good machine-learning algorithm. However, adding a couple of effectivemethodsandalgorithmswillraisetheirefficiency, accuracy,anddetectionrates.

REFERENCES

[1]. Abikoye, Oluwakemi & Gyunka, Benjamin & OLUWATOBI, AKANDE. (2020). Android Malware Detection through Machine Learning Techniques: A

[7]. Shatnawi,Ahmed&Jaradat,Aya&BaniYaseen,Tuqa& Taqieddin, Eyad & Al-Ayyoub, Mahmoud & Mustafa, Dheya. (2022). An Android Malware Detection Leveraging Machine Learning. Wireless Communications and Mobile Computing. 2022. 1-12. 10.1155/2022/1830201.

[8]. Liu,Kaijun&Xu,Shengwei&Xu,Guoai&Zhang,Miao& Sun,Dawei&Liu,Haifeng.(2020).AReviewofAndroid Malware Detection Approaches Based on Machine Learning. IEEE Access. PP. 1-1. 10.1109/ACCESS.2020.3006143.

[9].Wu,Qing,XuelingZhu,andBoLiu."Asurveyofandroid malwarestaticdetectiontechnologybasedonmachine learning." Mobile Information Systems 2021(2021):118.

[10]. Martín, Nacho & Hernández, José & Santos, Sergio. (2019). Machine-Learning based analysis and classification of Android malware signatures. Future Generation Computer Systems. 97. 10.1016/j.future.2019.03.006.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 03 | Mar 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page948

Fig - 1: ProcessFlowDiagram

Turn static files into dynamic content formats.

Create a flipbook