A SUPERVISED MACHINE LEARNING APPROACH USING K-NEAREST NEIGHBOR ALGORITHM TO DETECT FAKE REVIEWS ON

Page 1

A SUPERVISED MACHINE LEARNING APPROACH USING K-NEAREST NEIGHBOR ALGORITHM TO DETECT FAKE REVIEWS ON AMAZON

Javalekar5

Department of Information Technology, Zeal College of Engineering and Research, Pune-41, Maharashtra, India.

Abstract - In the enterprise marketing process, online data plays a very crucial role. As a type of data , fake reviews of the products have been seriously affecting the reliability of both decision making and data analysis of the enterprise. Some of the users tend to spread fake news through the online platforms and they are also unverified. Both the customers as well as the vendors get seriously affected due to fake reviews as the customer is not able to buy genuine product based on review and the vendor results in less sales of product due to the negative impact of review. To detect fake reviews has become a necessity of this era.The use of supervised machine learning is described in this system for detecting fake reviews.We offer a method to distinguish between real and false product reviews by letting users know whether or not they are trustworthy. This strategy for spotting fake reviews explains the application of supervised machine learning. This methodology was developed in response to shortcomings in how traditional fake review detection methods categorised reviews as true or false based on categorical datasets or sentiment polarity ratings. Our approach helps close this gap by taking into account both polarity ratings and classifiers for false review identification. As part of our endeavour, a survey of previously published articles was completed. Our system's accuracy was 88% thanks to the machine learning technique called Support Vector Machine[2]

Key Words: Enterprise Marketing, KNN, Fake Reviews, Decision Making, Data Analysis, Unverified, Supervised Machine Learning .

1. INTRODUCTION

TIn today's world, online shopping is one of the most importantaspectsofdailyliving.Manyeverydayconsumers use online reviews to choose which product to buy. On ecommerce platforms, customer reviews play a significant role in determining a company's revenue. Nowadays, it is simple to trick and control a customer by publishing a fictitiousreviewonaspecificproduct.TheCompetitionand MarketsAuthority(CMA)oftheUKclaimsthatfabricatedor inaccuratereviewsmayyearlyhaveanimpacton£23billion inconsumerspendinginthenation.OnAmazon,61%ofthe reviews on devices are fake. One in seven reviews on Tripadvisorcanbefalse.Numerousfalsereviewsononline review sites like Tripadvisor, Yelp, etc. either increase or decrease the popularity of a hotel or product. Many

websitevisitorsareunabletoquicklyrecognisefakereviews. Asaresult,thebuyerisdupedandtheirperceptionofreal items is manipulated. We thus decided to develop a userfriendlyfakereviewdetectionsystemtostoppeoplefrom being duped by fake reviews in order to overcome this discrepancybetweenfakeandfactualreviews.

2. LITERATURE REVIEW

The[1]fakefeatureframework,whichismadeupof2types ofreviews,isusedtocharacteriseandorganisethefeatures offalsereviews.UsingLogisticRegression,techniquesused toanalyseuser-centricfeaturesinAmazonelectronicproduct reviewsproducedanF-scoreof86%accuracy.

ThesystemwasbuiltonOpinion-Mining,whichmakesuseof Sentiment Analysis to identify false reviews. They had created a working model that collected annotations for individualreviewsinthedataset.Also,itwasdiscoveredthat SentimentAnalysisisamethodofimplementationinwhich Vader determined whether a passage of text is positive, negative, or neutral. The majority of sentiment analysis methods fall into one of two categories: valence-based (where texts are classified as positive or negative) or polarity-based(wherethestrengthofthesentimentistaken intoaccount).Forinstance,inapolarity-basedapproach,the terms"good"and"great"wouldbeviewedequally,whileina valence-based approach, "excellent" would be viewed as morepositivethan"good."

Accordingtoanotherstudy[3],therearetwomaintypesof phoney reviews: textual (dependent on the reviews' substance)and behavioural (completely dependent on the reviewer's writing style, emotional expressions, and frequency of writing). The difference between bogus and authenticreviewswasdeterminedusingavarietyofmachine learningalgorithms.

Identificationwascarriedoutbyconsideringboth"important elements of review" and "reviewer behaviour." Without behaviouralcharacteristics,LogisticRegressionprovidedan accuracyof86%inthebi-gram,KNNprovidedanaccuracyof 73%inthetri-gram,andSVMprovidedanaccuracyof88.1% inthebi-gram.

Thesystemcreatedtoidentifyfakehotelreviewsonyelp[4], tripadvisor, and other websites. The system's architecture

Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page819
***
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

includesawebcrawlerthatcompilesallthereviewdataand storesitinaMySQLdatabase.Fourseparatetechniqueswere used to find the fake reviews: text mining-based categorization,spellchecking,reviewerbehaviouranalysis, andhotelenvironmentanalysis.Afterallfactorshavebeen evaluated,theoverallprobabilityoffraudulentreviewsfora certain hotel is computed using a grading algorithm employingtheindividualprobabilities.Whenputtingthetext mining-based false detection method into practise, the standard pre-processing recommendations were followed. Thepercentageoffakereviewswasabout14%.Priorstudy hasusedandconfirmedthisdatasourcefordeterminingthe veracityofthehotels.

One of the studies focused on Sentiment analysis[5] and MachineLearningapproachinfindingtheFakereviews.This systemusedMLAlgorithmsNaïveBayes,KNN,SVM,Decision Tree(j48).SVM(81.75%)[5]outperformsotheralgorithmsin bothw/andw/ostopwordapproaches.

TheassumptionwasmadethatclassificationofFakereviews iseitherTrueorFalse[6].Whenrecognisingfakereviews,it is important to consider the reviewer's credibility, the dependabilityoftheproduct,andthereviewer'shonesty.Asa result,NaiveBayesdelivered98%accuracywhereasRandom Forestproduced99%accuracy.

Another research [7] centred on the methods used for classifying fake reviews which are Content based method which considers POS tag frequency count as a featureand Behaviour Feature based method which considers unfair rating as a feature. The unsupervised machine learning algorithmusedforthispurposeisExpectationMaximisation which gave accuracy of 81.34% and supervised machine learningalgorithmusedisSVMandNaiveBayeswhichgave accuracyof86.32%.

Oneofthestudies[8]suggestedCombinationofclassification algorithmswithLDAwhichyieldedhigheraccuracyresults. The traditional SVM, Logistic regression and Multi-layer perceptronmodelgaveaccuracyof65.7%,80.5%and80.3% respectively. When combined with LDA the SVM, Logistic RegressionandMulti-layerperceptronmodelgaveaccuracy of67.9%and81.3%respectively.

Threetechniquesareusedinthesystem[9]toclassifyfalse reviews. The first one is Review Centric Approach which considers content of review, use of capital letters, and numericals.

The second approach is Reviewer Centric approach which considersprofileimage,URLlength,IPaddress,etcandthe thirdapproachisProductCentricApproachwhichconsiders rankofproduct,priceofproductasfeature.Thealgorithms used to detect the fake reviews were supervised, unsupervisedandsemi-supervised.

One of the systems we studied focuses on annotating the sentiments of a review [10] using VADER. The gathered reviews are cleaned and opinion mined, after which the sentiment analysis step takes place. The results of the sentiment analysis are then appended to the dataset and classifiedusingvectorcalculation.

Research [11] primarily focused on reviews that were produced with the intention of seeming authentically misleading.Thedatasetunderwentsentimentanalysisinthe system[12]. Two classification models, a two-way classificationthatcategorisesreviewsaspositiveornegative, and a three-way classification that categorises data as positive, negative, or neutral with sentiment analysis in between,werereported.

Using R, an open source statistical programming language and software environment, additional experimental work was conducted. It is especially helpful for data processing, data analysis, calculations, and the graphical display of results.

Another method for identifying phoney reviews is to use reviewerbehaviourandhistoryanalysis,asexplainedin[13]. Thefocusofthestudyisonexploitingjaccardsimilarityto distinguish between human users and bots. If product reviews are overly favourable or unfavourable and are structuredsimilarly,theymaybedeemedmanipulated.

3. METHODOLOGY

Ourrecommendedmethodologyisasfollows,whichisbased onthestudywedidandobservationsofhowearliersystems operated.

The reviews from the user-provided website link will be scraped as the first stage. Selenium is used to execute the webscraping.Afterthereviewsaresavedinjsonformat,the necessaryinformationwillberetrieved,suchwhetherthe reviewisfavourableornegative,onlythenecessarycontent from the review, etc. At this point, the ML model will be prepared, and the extracted data will be input into it. The training dataset will then be used by the algorithm to determinewhetherthereviewsarefraudulentorlegitimate. Theuserwillbeabletorecognisethisoutputbecauseitwill bevisualisedonthewebsite.

Thestepssummarisedareasfollows-

1. ScrapingofthereviewsfromAmazon

2. Extractionofnecessaryinformation

3. Transferringthedataasinputtomodels.

4. Adaptingalgorithmstotrainingset

5. Predictingthetestdataresults.

6. Visualisingthetestdatasetresult.

Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page820
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

4. PROPOSED SYSTEM ARCHITECTURE

Inordertodeterminetheoptimummodeltobeutilisedto obtain the highest accuracy and quickest speed, the approach is explained in detail and is carried out in five majorparts.

Using web scraping,datawillbecollectedfromtheonline websites

4.1 Data Collection

Consumer review data collection- Raw review data was collectedfromKaggle’sAmazondatasetsreviews.Doingso wastoincreasethediversityofthereviewdata.Adatasetof 40000thousandreviewswascollected.

4.2 Data Preprocessing

Processing and refining the data by removal of irrelevant andredundantinformationaswellasnoisyandunreliable datafromthereviewdataset.Step1:Sentencetokenization Theentirereviewisgivenasinputanditistokenized.Step2: RemovalofpunctuationmarksPunctuationmarksusedat the starting and ending of the reviews are removed along with additional white spaces. Step 3: Word Tokenization Eachindividualreviewistokenizedintowordsandstoredin a list for easier retrieval. Step 4: Removal of stop words Affixesareremovedfromthestem.

4.3 Sentiment Analysis

Classifyingthereviewsaccordingtotheiremotionfactoror sentiments being positive, negative or neutral. It includes predictingthereviewsbeingpositiveornegativeaccording tothewordsusedinthetext,emojisused,ratingsgivento the review and so on. Related research shows that fake reviewshasstrongerpositiveornegativeemotionsthantrue reviews.Thereasonsarethat,fakereviewsareusedtoaffect peopleopinion,anditismoresignificanttoconveyopinions thantoplainlydescribethefacts.TheSubjectivevsObjective ratio matters: Advertisers post fake reviews with more objective information, giving more emotions such as how happyitmadethemthanconveyinghowtheproductisor whatitdoes.Positivesentimentvsnegativesentiment:The sentiment of the review is analyzed which in turn help in makingthedecisionofitbeingafakeorgenuinereview.

4.4 Feature extraction/Engineering

Itmainlyinvolvesreductionofthenumberofresourcesso that a large dataset can be described. Selection of the appropriatefeaturesforpredictingtheresults.

4.5 Fake Review Detection

Classification assigns items in a collection to target categories or classes. The goal of classification is to

accuratelypredictthetargetclassforeachcaseinthedata. Each data in the review file is assigned a weight and dependinguponwhichitisclassifiedintorespectiveclassesFakeandGenuine.

5. EXPERIMENTAL RESULTS

TheTrainedmodelisusedforpredictionofthefakereviews and genuine reviews. Different libraries are available in Pythonthathelpsinmachinelearning,classificationprojects. Severalofthoselibrarieshaveimprovedtheperformanceof this project. For user interface we have used Web technologiestobuildwebsite.

Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page821
Fig -1 : DataFlowDiagram Fig -2 :ProposedSystemArchitecture
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

6. IMPORTANT LIBRARIES

First,“Numpy”thatprovideswithhigh-levelmathfunction collectiontosupportmulti-dimensionalmatricesandarrays. This is used for faster computations over the weights (gradients)inneural networks. Second,“scikit-learn”isa machinelearninglibraryforPythonwhichfeaturesdifferent algorithmsandMachineLearningfunctionpackages. NLTK, natural languagetoolkitishelpful inwordprocessingand tokenization. The project makes use of Anaconda Environment which is an open source distribution for Python which simplifies package management and deployment.Itisbestforlargescaledataprocessing

7. CONCLUSION

UsingSupervisedMachinelearningalgorithmwecaneasily differentiatebetweenfakeorrealreviews.Thefakereview detectionisdesignedforfilteringthefake reviews.Inthis researchworkSVMclassificationprovidedabetteraccuracy of classifying than the other algorithm for testing dataset. Also,theapproachprovidestofindthemosttruthfulreviews toenablethepurchasertomakedecisionsabouttheproduct. It will benefit the customer as well as the company since theywillgetthegenuinereviewsbythecustomersaboutthe

product. Customers will be able to buy the best product basedontheclassificationoffakereviews.

REFERENCES

[1] Rodrigo Barbado, Oscar Araque, Carlos A. Iglesias, “A frameworkforfakereviewdetectioninonlineconsumer electronics retailers”,Volume 56, Issue 4,July 2020, Pages12341244,ISSN03064573,https://doi.org/10.101 6/j.ipm.2020.03.002

[2] N.Kousika,D.S,D.C,D.BMandA.J,"ASystemforFake NewsDetectionbyusingSupervisedLearningModelfor Social Media Contents," 2021 5th International Conference on Intelligent Computing and Control Systems(ICICCS),2021,pp.10421047,doi:10.1109/ICICCS51141.2021.9432096.

[3] AmmarMohammed,AtefIbrahim,UsmanTariq,Ahmed Mohamed Elmogy “Fake Reviews Detection using Supervised Machine Learning” ,January 2021 InternationalJournalofAdvancedComputerScienceand Applications 12(1),DOI:10.14569/IJACSA.2021.0120169

[4] MöhringM.,KellerB.,SchmidtR.,GutmannM.,DackoS. HOTFRED: “A Flexible Hotel Fake Review Detection System''. In: Wörndl W., Koo C., Steinmetz J.L. (eds) Information and Communication Technologies in Tourism. Springer,Cham.https://doi.org/10.1007/978-3-03065785-7_29.

[5] E. Elmurngi and A. Gherbi, "An empirical study on detecting fake reviews using machine learning techniques," Seventh International Conference on Innovative Computing Technology(INTECH),pp.107114,doi:10.1109/INTECH.2017.8102442.

[6] Neha S.,Anala A. ,“Fake Review Detection using Classification”,une,International Journal of Computer Applications180(50):1621,DOI:10.5120/ijca2018917316

[7] R. Hassan and M. R. Islam, "Detection of fake online reviews using semi-supervised and supervised learning,"International Conference on Electrical, ComputerandCommunicationEngineering(ECCE),pp. 1-5,doi:10.1109/ECACE.2019.8679186.

[8] S. Jia, X. Zhang, X. Wang and Y. Liu, "Fake reviews detectionbasedonLDA,"4thInternationalConference onInformationManagement(ICIM),2018,pp.280-283, doi:10.1109/INFOMAN.2018.8392850.

[9] N. A. Patel and R. Patel, "A Survey on Fake Review Detection using Machine Learning Techniques," 4th InternationalConferenceonComputingCommunication

Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page822
Fig -3:ProductNeedtobegivenbyuser(prototype) Fig -4: WebScrappingofgivenproduct(prototype)
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

and Automation (ICCCA), 2018, pp. 1-6, doi: 10.1109/CCAA.2018.8777594.

[10] DhairyaPatel1,AishweryaKapoor2,SameetSonawane3 ,“Fake Review Detection using Opinion Mining”,https://www.irjet.net/archives/V5/i12/IRJETV5I12154.pdf

[11] Ott, Myle & Choi, Yejin & Cardie, Claire & Hancock, Jeffrey.(2011)“FindingDeceptiveOpinionSpambyAny StretchoftheImagination”.

[12] Dipak R. Kawade et al., “Sentiment Analysis Machine Learning Approach”, / International Journal of Engineering and Technology(IJET),DOI: 10.21817/ijet/2017/v9i3/1709030151,Vol9No3JunJul2017

[13] Sadman, Nafiz & Gupta, Kishor Datta & Sen, Sajib & Poudyal,Subash&Haque,Mohd.(2020).DetectReview Manipulation by Leveraging Reviewer Historical Stylometrics in Amazon, Yelp, Facebook and Google Reviews.10.1145/3387263.3387272.

BIOGRAPHIES

1Dr. T. Praveen Blessington Professor, IT Department, Zeal College of Engineering and Research,Pune. He is having 17 yearsofexperienceinteachingand research.His areas of interest are VLSIDesign,IoT,MachineLearning andBlockchain technology. He haspublished 25 research papers sofar.

2Gaurav Pawar

Pursuing Bachelor of Engineering in Information Technology from Zeal College of Engineering and Research,Pune-41.

3Shrinivas Pawar

Pursuing Bachelor of Engineering in Information Technology from Zeal College of Engineering and Research,Pune-41.

4Onkar Davkare

Pursuing Bachelor of Engineering in Information Technology from Zeal College of Engineering and Research,Pune-41.

5Rutuja Javalekar

Pursuing Bachelor of Engineering in Information Technology from Zeal College of Engineering and Research,Pune-41.

Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page823
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Turn static files into dynamic content formats.

Create a flipbook