MACHINE LEARNING APPROACH TO LEARN AND DETECT MALWARE IN ANDROID
Bindu P1 , Chandana K S2 , Ranjith U3 , Chandanraj R J4Professor Krupa K S, Dept. of Information Science and Engineering, Global Academy of Technology, Karnataka, India ***
Abstract - Smartphones have become indispensable in modern life as a result of their extensive use in recent years. New solutions have been developed by users to allow them to keep critical data on their mobile devices. Attackers' main focus, however, is on data related to privacy. As a result, hackers constantly develop new methods to steal data from users' devices. To guarantee the defence of users' confidential information from intruders, several antimalware solutions are created. Based on how they detect malware, we classifya lot of recent antimalware techniques. Our goal is to present a clear and brief overview of malware detection and defence procedures. We provide an ANN and SVM-based technique to identify malicious and good apps in this study.
Key Words: Android Malware, Smartphones, Machine learning, SVM, ANN
1. INTRODUCTION
Mobilephonesandothersmartdevicesarevitalinpeople's livesduetotheirfeaturesandscalability.Theyinteractwith people in a variety of settings, including the workplace, entertainment, money management, etc. Smartphones are themajortargetforattackerssincetheyarewherepeople save their important data. Every day, new methods for attackerstoobtaindatafromasmartphonearedeveloped. Androidmalwareissoftwareorcodecreatedspecificallyto interfere with, harm, or intrusion into a system. Another nameforitisharmfulsoftware.
IntheareaofITsecurity,malwareisaserioushazardand source of concern. In terms of privilege escalation, tariff theft,remotecontrol,andprivacyleakage,itposesathreatto systemandusersecurity.
The system will be easily targeted by malware because it allowsuserstoinstallunlicensedorunapprovedsoftware. MalwaredetectionforAndroiddevicesbecomescrucial.
Withthisproject,weofferaworkablemethodforlocating mobile apps. Although the design and publication of their Androidapplicationsareconvenient,themarketforthirdpartyapplicationsfacesdifficulties.Thisdoesnotguarantee smartphone security. The drawback of signature-based malware detection technology is that it allows users to identifyspywarethattheyareunawareof.
1.1 Malware Classification
Malware is a software program that enters into a user’s device without permission and has intentions to cause damageorstealpersonalinformation.
Virus: Virens sneak into the system and infect users or spread throughout it by affixing themselves to other executables. The virus "Creeper," one of the earliest varieties,wascreatedin1971asatestrunforaprogramme designedbyBobThomasatBBN.Asitdidnotdamagethe data, it was not malicious software that was actively operating.
Trojans: Trojans are malicious files that masquerade as helpfulfilesinordertogainaccesstoasystem.
Worms: A sort of virus that may duplicate itself automatically on various PCs and gadgets and propagate across the internet. They can spread themselves without becomingapartofotherprogrammes.
Spyware: The term "spyware" refers to a potentially unwanted programme (PUP). It is an undesirable programme that aims to steal personal information (such userpasswordsorbankaccountinformation)andInternet usagedatawithouttheuser'sknowledgeorconsent.
Adware: Atypeofspyware,adware.Yet,theadwaredoes notwanttoharmthePC.Itsprimarygoalistopiqueauser's curiositysothatcomparableadvertising, emails,andpopupscanbeshowntothem.
Ransomware: Ransomwareisatypeofmalicioussoftware in which an attacker encrypts all the victims' files and demandspaymenttogetthemdecrypted.
1.2 Malware Features
Malwareidentificationhastwostages:featureextractionand classification/clustering.Malwarecharacteristicsrefertothe datathatcanbeusedformachinelearningalgorithms.These specify the universe of ideas that can be represented by machine learning. Malware features can be divided into staticanddynamicgroups.Thecharacteristicsthatcanbe extracted without the use of malware are known as static features. These characteristics are derived from malware binariesthroughanalysis.Whiledynamiccharacteristicsare thosethatareobtainedafterabinaryfilehasbeenexecuted. The dynamic analysis process is used to extract these
characteristics.Malwareis categorized andgroupedusing machinelearningmethods.Onlyclassificationtechniquesare described in this paper. Some of the features of malware detectionwerementionedinthispart.
1.2.1 N-Gram
It is one of the more typical static characteristics and comprisesofnconsecutivebytestakenfromtheoutputof thehexdump.Themostwidelyusedcharacteristic,4Gram, refers to taking a combination of 4 bytes. You can use NGraminoverlapping or non-overlappingsituations. When thereisnocrossover,abytethathasalreadybeenusedonce canbeusedagaininthefollowinggramme.
1.2.2 Opcodes
Opcodesrefertovariousmachine-levelprocessescarriedout byprogrammableexecutables.Youcanfindtheseopcodesin theassemblycode.
1.2.3 Strings
Thedefinitionofastringisagroupofprintableletters.They discoveredthattheheadersinPE-formatcontainplaintext that can be used to retrieve data. Additionally, non-PE executableshaveencodedstringsaswellthatcanbeused withtheir information.A more precise meaning ofstrings statesthattheycanonlybeusediftheyare"interpretable" andmakesomesortofsemanticsense.
1.2.4 Memory Access
Theprimarymemoryisusedtostorealotofinformation, including configuration, network activity, and window registrykeys. Thus,bylookingathow memory is utilized, crucialinformationaboutmalwarecanbegleaned.
1.2.5 API Calls
TheApplicationProgrammingInterface(API)actsasalink betweentheoperatingsystemandapplications.Someduties, such as writing a file to the disc, can only be completed directlybytheoperatingsystem,andthesystemlibraryhas alibraryofthesefunctions.Acalltothesystemlibrarymade byanapplicationisknownasanAPIcall.Forinstance,the CopyFileWAPIwillbeusedwhenafileneedstobecopied. ManyexpertsuseAPIcalls,whichareacrucialcomponentof malwaredetection.AnAPItraceiswhatwetermasequence ofAPIcallsthatcanbepreserved.ThisAPItracecanidentify advancedmalwarebehaviorslike"walkingthroughfolders" and"copyingitselftodisc".
2. LITERATURE SURVEY
[1] “Android Malware Detection through Machine
Learning Techniques: A Review”: The author of this articleusedavarietyofmethodologies,includingRandom
Forest,SVM,andDecision Tree,anddeterminedthat both highaccuracyandefficiencycanbeattained.
[2]“On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities”: Aninnovativeprocedural taxonomy for ML-based Android malware detection was presentedbytheauthorinthisarticle.Hetalkedaboutthe sourcesofmaliciousandbenignAPKsaswellasthekindsof static and dynamic characteristics that researchers have gleanedfromthem.Healsolookedintohowtogetridofthe less useful aspects. It is possible to use and classify ML systems.
[3]“Analysis of Android Malware Detection Techniques: A Systematic Review”: Theauthorofthispaperprovideda comparative analysis of various Android mobile malware detectionmethods.Throughacriticalanalysis,thisresearch wasabletoidentifyalloftheweaknessesandadvantagesof eachdetectionmethod.Thefindingssupporttheclaimthat detection techniques created for Android viruses do not alwaysresultin100%accuratedetection.
[4] “Detection of Android Malware using Machine Learning and Deep Learning Review”: Based on the results of this study, experiments were conducted in this paper to select authorization and API-related information features for machine learning. According to the findings, evolutionaryautomatedprocessfeatureselectionwasmore advantageousthanacommonknowledgegain.Thegenetic algorithm still beats non-selection in terms of model generation time, even though its attribute selection performancewasgenerallyreducedbylessthan3%.
[5] “AndroidMalwareDetectionUsingMachineLearning Classifiers”: The author suggests that category-based machine learning classifiers boost the effectiveness of the classification.Machinelearningalgorithmshavebeenused to train classifiers with attributes of malicious apps and construct models that are capable of detecting dangerous patterns in the static analysis of Android malware. The authorcreatesaprofileofthesetoffeaturesforthecategory ofthetop-ratedapplicationsinthatcategory.Todetermine whetheranapphasbenigncharacteristics,wecompareits features to those that are required to provide the functionalityofthecategorytowhichtheappbelongs.
[6] “Android Mobile Malware Detection Using Machine Learning: A Systematic Review”: Accordingtotheauthor, classificationaccuracyisimprovedbyusingcategory-based machinelearningmodels.Inordertobuildmodelsthatcan identifypotentiallyhazardouspatternsinthestaticanalysis ofAndroidmalware,machinelearningalgorithmshavebeen used to teach classifiers with characteristics of malicious apps. The author compiles a list of features for the most popularapplicationsin eacharea.Bycontrastinganapp's features with those needed to provide the utility of the
categorytowhichitbelongs,wecandeterminewhetherit hasbenigncharacteristics.
[7] “AnAndroidMalwareDetectionLeveragingMachine Learning”: Theauthorlookedintotheeffectivenessoffour machine learning algorithms that try to identify malware basedonpermissionsandactionrepetition,alsoknownas staticanddynamicfeatures.Hedividedtheundertakinginto threephases.Thefindingsshowthatclassificationprecision wasverygood.Theresultsalsodemonstratedagreatlevelof accuracy.Therefore,forcategorization,usingstaticanalyses aloneshouldbeeffectiveandinexpensive.
[8] “A Review of Android Malware Detection Approaches based on Machine Learning”: The study into Android malware detection is done in this paper. Authors used supervised learning techniques like Multinomial NB, Random Forest, and SVM. In addition to execution time, metrics like precision, recall, and F-measure are used to assess the outcomes. In comparison to other models, the SVMmodelperformedbetterintermsofprocessingtime.
[9] “AsurveyofAndroid MalwareDetectionTechnology Based on Machine Learning”: In this paper, the author used the vertical comparison technique to analyse the algorithm model, fundamental concepts, datasets, and performancemetricsoftheexistingmethods.Incomparison, themachinelearning-basedstaticdetectiontechniquehas advantages in accuracy and requires fewer detection experts.
[10]“Machine -Learning based analysis and classification of Android malware signatures”: This study examined 259,608malwaresignaturesthatweredetectedinavariety of Android apps. To allow cross-engine analysis, the signatureshavebeennormalisedintoacommonnamespace using the Signature Miner tool. Then, malware signatures have been examined by grouping the families into three categories, according to the hazard and nature of each threat:Adware,Harmful,andUnknown.
3. EXISTING METHODOLOGY
After looking over the contributions made by different authorsintheliteraturestudy,wediscoveredthatmalware detection in the current system is done using either a signature-basedmethodoraheuristicmethod.Theexisting models'trainingisprimarilybasedonsimpleclassification algorithms, which makes it difficult to more accurately optimise the dataset. High false positive rates may result from some new malware signatures that are not yet in malwarerepositoriesfailingtoberecognisedandasaresult failingtoidentifythemalware.
4. PROPOSED METHODOLOGY
Theproposedapproachdetermineswhetheranapplication ismalwareornotbyclassifyingitusingtheidealmachine
learningalgorithm.Inourundertaking,weuseGridsearch andSVCtofine-tunethemodel.Weprimarilyconcentrateon increasingtheprecisionoftheclassificationoftheAndroid applicationintheprovideddatasetusingGridsearch.
A. Dataset Creation
Wewillfirstdownloadthedatasetofbenignandmalicious software.Adware,Ransomware,Scareware,PremiumSMS, SMSMalware,andBenign2015,Benign-2016,andBenign2017apksareallincludedinthecollection.Thedownloaded dataset will then be unzipped, and malicious and benign folders will be created for the appropriate applications. Adware, Ransomware, Scareware, PremiumSMS, and SMS Malwarearenowincludedinthesourcesformalware,while Benign-2015,Benign-2016,andBenign2017areincludedin thesourcesforinnocuoussoftware.
B. Data Pre-processing
Thenumberofbothmaliciousandpositiveapplicationswill then be counted, and both counts will be returned. Then, we'llusetheandroguardtooltoextracttherightsandlook only for those that begin with permission. The file "permissions.txt"containsallthepermissionsthathavebeen extracted from all applications. We will now individually checkeachapkforrights.Itismarkedas"1"ifthespecific apk is using that authorization; otherwise, it is marked as "0".TheCSVfilelabelsthese0sand1swithrights.TheCSV's finalentryfallsunderthebenignormalignantcategory.A CSVfilecontainingthispreprocessedinformationiskeptfor useinfuturemodeltraining.
C. Training the model
Support Vector Machine (SVM) and artificial neural networks are used to develop the model (ANN). From the labeled examples, these models are trained to extract the requiredweightsandbiases.Usingbinarycross-entropy,the modelisbuilt,andonlytheepochsthatgeneratebetterdata thanthepreviousepocharechosen.Over100epochshave beenusedtoteachthemodel.Ifthereisnoimprovementin accuracyoverasetnumberofepochs,themodel'straining willendimmediately.
D. Evaluation of the model against the validation set and predicting the accuracy and loss
The model's performance is measured after it has been comparedtothevalidationandtestdatasets.TheF1valueis computed.Finally,theaccuracyandcostarecalculated.The model is then preserved and transferred into apk file for front-endusewithFlask.
E. Fine-tuning to find the best possible accuracy.
Inthissection,weattempttomodifythemodelinafewways thatwillhelpusimproveitsoverallaccuracy.InSVM,we've
usedSVCwiththeGridSearchtechnique,whichemploysthe best parameters to be specified in the kernel 'rbf' to finetunethemodel.Ifweweretorunthemodelrightnow,the accuracywouldhaveimproved,increasingitsefficiency.
D. Implementation of a front-end application to determine whether a particular apk contains malware or not.
Afterbeingpreserved,themodelisdroppedintoafront-end application.Tosubmittheapkfileforwhichthepredictionis to be made, a user-friendly and interactive website using webprogramminglanguageslikeHTML,CSS,andFlaskmust becreated.Theuseristhenshowntheforecast.
Review.InternationalJournalofOnlineandBiomedical Engineering(iJOE).16.14.10.3991/ijoe.v16i02.11549.
[2]. Koushki, Masoud & Abualhaol, Ibrahim & Durai Raju, Anandharaju & Zhou, Yang & Giagone, Ronnie & Shengqiang, Huang. (2022). On building machine learning pipelines for Android malware detection: a procedural survey of practices, challenges and opportunities.Cybersecurity.5.16.10.1186/s42400022-00119-8.
[3]. Ashawa, Moses & Morris, Sarah. (2019). Analysis of AndroidMalwareDetectionTechniques:ASystematic Review. International Journal of Cyber-Security and DigitalForensics.8.177-187.10.17781/P002605.
[4]. Joshi,Prof.(2022).DetectionofAndroidMalwareusing Machine Learning and Deep Learning Review. International Journal of Recent Technology and Engineering (IJRTE). 11. 134-139. 10.35940/ijrte.A6963.0511122.
[5]. Ali,Huda&Oh,Tae&Fokoue,Ernest&Stackpole,Bill. (2016). Android Malware Detection Using CategoryBased Machine Learning Classifiers. 54-59. 10.1145/2978192.2978218.
[6]. Senanayake,Janaka&Kalutarage,Harsha&Al-Kadri,M. Omar.(2021).AndroidMobileMalwareDetectionUsing Machine Learning: A Systematic Review. Electronics. 10.1606.10.3390/electronics10131606.
5. CONCLUSIONS
OntheAndroidplatform,researchershavelookedintothe use of ML for automated malware detection. Building an intricate, multi-staged workflow is necessary for machine learning. Because of this, it has been challenging for new researcherstounderstandthestate-of-the-artinthisarea. Overtime,therehasbeenasharpriseinsmartphoneusage. Malware writers have used the significant increase in Androiduserstosimultaneouslytargetandharmusers.This paper reviewed current frameworks and algorithms for machinelearning-basedmalwaredetection.100%accuracy witha0%falsealarmratewasthebest.Theidentification rateswererespectively83%and90%.Toidentifymalware with high accuracy and detection rate, we need a good machine-learning algorithm. However, adding a couple of effectivemethodsandalgorithmswillraisetheirefficiency, accuracy,anddetectionrates.
REFERENCES
[1]. Abikoye, Oluwakemi & Gyunka, Benjamin & OLUWATOBI, AKANDE. (2020). Android Malware Detection through Machine Learning Techniques: A
[7]. Shatnawi,Ahmed&Jaradat,Aya&BaniYaseen,Tuqa& Taqieddin, Eyad & Al-Ayyoub, Mahmoud & Mustafa, Dheya. (2022). An Android Malware Detection Leveraging Machine Learning. Wireless Communications and Mobile Computing. 2022. 1-12. 10.1155/2022/1830201.
[8]. Liu,Kaijun&Xu,Shengwei&Xu,Guoai&Zhang,Miao& Sun,Dawei&Liu,Haifeng.(2020).AReviewofAndroid Malware Detection Approaches Based on Machine Learning. IEEE Access. PP. 1-1. 10.1109/ACCESS.2020.3006143.
[9].Wu,Qing,XuelingZhu,andBoLiu."Asurveyofandroid malwarestaticdetectiontechnologybasedonmachine learning." Mobile Information Systems 2021(2021):118.
[10]. Martín, Nacho & Hernández, José & Santos, Sergio. (2019). Machine-Learning based analysis and classification of Android malware signatures. Future Generation Computer Systems. 97. 10.1016/j.future.2019.03.006.