UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS

Page 1

UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS

Abstract - Applications in many domains make Sentiment Analysis an exciting area for study. The use of online polls and surveys to get feedback from the public regarding goods, current events, and societal or political issues are on the rise. The public and the stakeholders benefit from hearing the thoughts and feelings of the general public when important choices must be made. Opinion mining is the practice of gleaning insights from online sources including web search engines, blogs, micro-blogs, Twitter, and social networks to produce meaningful conclusions. Twitter's user base provides a wealth ofmaterial from which to get insight intothepublic's perspective. The massive volume oftweets as theunstructured text makes it challenging to physically delineate the information. Consequently, extracting and condensing the tweets from corpora calls for expert computational methodologies, which in turn necessitates familiarity with terms that convey emotion. Sentiment analysis from the unstructuredtext may be accomplishedusingawidevarietyof computer methodologies, models, and algorithms. The vast majority are based on machine learning methods, namely the Bag-of-Words (BoW) representation. Inthisresearch,weused a lexicon-based strategy to automatically identify sentiment for tweets gathered from the Twitter public domain. To further investigate the efficacy of alternative feature combinations, we have used three distinct machine learning algorithms for the task of tweet sentiment identification: Naive Bayes (NB), Maximum Entropy (ME), and Support Vector Machines (SVM). Our results suggest that bothNBwith Laplace smoothingandSVM are successful incategorizingthe tweets. The feature usedfor NBis unigramandPart-of-Speech (POS), while unigram is utilized for SVM.

Key Words: Bag-of-Words, Lexicon, Machine Learning Algorithms,LaplaceSmoothing,Part-of-Speech.

1. INTRODUCTION

It has been found via two separate polls of over 2000 American adults that 81% of Internet users (or 60% of Americans)havedoneproductresearchonlineatleastonce andthat20%ofInternetusers(15%ofAmericans)preferit onacertainday.Wemayclaimthatpeople'sconsumptionof goods and services is not the only factor for their online information-seekingandopinion-sharingactivities.Theneed foraccesstocurrentpoliticalinformationisanothercritical factor to consider. At the moment, individuals may utilize email for political campaigns by sharing information and discussing candidates and issues online. The user trusts

internetadviceandsuggestionssincetheydealmostlywith an opinion. Despite the generally pleasant experiences of American Internet users during online product research, Horrigan[1]foundthat58%ofusersreportedexperiencing missing, difficult-to-discover, confused, or overwhelming onlineinformation.Therefore,thereisasignificantneedfor improvedinformation-accesstechnologiestoaidshoppers andresearchers.Web2.0siteslikeblogs,messageboards, andotherkindsofsocialmediahavemadeiteasierthanever for customers to voice their thoughts and views on the brandstheyuse.Inrecentyears,businesseshavebegunto acknowledgethepowerthatuserreviewshaveonshaping theperceptionsofothersandthestandingofcertainbrands. Companiesarebeginningtowatchsocialmediatoreactto customer feedback and adjust their marketing, brand positioning, product development, and other strategies appropriately.

1.1. Opinion Mining and Sentiment Analysis

Extractingviewsfromtextiscalled"OpinionMining"(OM). Viewpointmining(OM)isanewfieldattheintersectionof information retrieval, text mining, and computational linguistics that seeks to detect the opinion represented in natural language texts, as described by Pang et al. [3]. Opinion mining is a subfield of KDD that employs Natural LanguageProcessing(NLP)andstatisticalmachinelearning methods to identify and distinguish between opinionated andfactualcontent.Tasksinopinionminingincludelocating opinions, labeling them as favorable, negative, or neutral, determining where those opinions originated, and summarisingthem.Toautomaticallyextractasummaryof anentity'sopinionfromalargebodyoftheunstructuredtext istheprimarygoaloftheOpinionMiningassignment.

OpinionMiningandSentimentAnalysis(SA)aretwonames for the same thing: the study of how people feel about something. An individual's thoughts, feelings, and impressionsaboutamatter,asexpressedintheformofan opinion,aredeeplypersonal andconfidential.Individuals, groups, and societies may benefit greatly from the advice and counsel of others throughout the decision-making process, as concluded by the work of Liu et al. [2]. To act swiftlyandwisely,humansdemandinformationthatisboth preciseandbrief.Whilemakingachoice,peopleoftenseek advicefromfriends,family,andexpertsforwhomtheyhave developedanopinionorpointofviewbasedontheirown

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page95
Akanksha 1, Mr. Sambhav Agarwal2 1M.Tech, Computer Science and Engineering, SR Institute of Management & Technology, Lucknow, India 2Associate Professor, Computer Science, and Engineering, SR Institute of Management & Technology, Lucknow
***

experiences,observations,conceptions,andbeliefs(which mayormaynotbegoodornegative).

2. SENTIMENT TARGET IDENTIFICATION

Identifyingsentiment(opinion)targetsisacrucialpartofSA work.Theaimheremightbeanythingfromthesubjectof the statement to the object of that statement. Everyone involved in making and selling a product has to do a thorough evaluation of it in light of public and buyer feedback.Automaticallyidentifyingandextractingaspects mentionedinreviewsisakeystepinconductingareview comparison.Opinionminingandsummarization,thus,rely heavilyonproductfeaturemining[10].Sentimentanalysisis adifficultfieldofstudy.Thisisbecauseasystemhastobe able to discern evaluative expressions and some qualities thatarenotovertlypresentandneedtobeidentifiedfrom thetermsemantictocorrectlyidentifyopiniontargetsina phrase or document. Previous studies on the topic of sentiment target identification have shown that several Natural Language Processing (NLP) methods, including processing,Part-of-Speechtagging,noisereduction,feature selection,andclassification,areallnecessarystagesinthe extractionprocess.

3. METHODOLOGY

Researchdatacollectingismorecomplexthanitmayseem sinceitrequiresdrawingimportantandrelevantinferences. Testdata,subjectivetrainingdata,andobjective(neutral) training data are the three types of data that have been gathered.TheTwitterAPIwillbecoveredbeforehand.

3.1. Twitter API

Developers may access Tweets, DMs, media, and other Twitter data using the Twitter API, which provides a collection of programming interfaces. Through the API, programmersmaycreateproductsthatcommunicatewith the Twitter service and carry out actions like publishing Tweets, getting user information, and viewing trending topics, among other things. Different endpoints, authenticationmechanisms,anduseconstraintsapplytothe API'sseveralflavors,whichincludeREST(Representational State Transfer), streaming, and advertising. A Twitter developer account and API keys (also known as access tokens)areprerequisitesforinteractingwiththeAPI.

3.2. Twython

TwythonisaPythonlibraryforaccessingtheTwitterAPI.It providesasimpleandconvenientwayforPythondevelopers tointeractwiththeTwitterplatformandperformtaskssuch aspostingTweets,retrievinguserinformation,andaccessing timelines.Twythonabstractsmanyofthecomplexitiesofthe Twitter API and provides a simple, Pythonic interface for accessingtheAPI'sresources.TouseTwython,youwillneed toobtainAPIkeysoraccesstokensfromaTwitterdeveloper account, and then use these credentials to initialize a

Twython client object, which you can use to make API requests. The library supports both REST and Streaming APIsandincludesfunctionalityforOAuth1.0aandOAuth2.0 authentication.

3.3. Data Preprocessing in Twitter

Data preprocessing in Twitter involves cleaning and transformingTwitterdataintoaformatthatissuitablefor furtheranalysisormodeling.Thismayincludetaskssuchas:

1. DataCollection:CollectrawdatafromtheTwitter API,suchastweets,userprofiles,andtrends.

2. Data Cleaning: Removing irrelevant information, correcting errors, handling missing values, and removingduplicatesfromthecollecteddata.

3. Text Processing: Processing textual data from tweets, such as removing stop words, stemming, andconvertingtexttolowercase.

4. SentimentAnalysis:Classifyingtweetsintopositive, negative,orneutralsentimentcategories.

5. Data Transformation: Converting the data into a format that is suitable for analysis, such as converting textual data into numerical representations.

6. DataReduction:Reducingthedimensionalityofthe data,suchasaggregatingdatabyuserorperiod.

Thesestepsensurethatthedataisinaclean,consistent,and usableformat,andhelpimprovetheaccuracyandreliability ofanysubsequentanalysisormodeling.

3.4. Lexicon-Based Approach

Thelexicon-basedapproachisamethodusedinsentiment analysis and opinion mining to classify the sentiment of a piece of text, such as a tweet, into positive, negative, or neutralcategories.Theapproachinvolvesusingapredefined lexicon,oralistofwords,thatareassociatedwithspecific sentiments.

Inalexicon-basedapproach,thesentimentofapieceoftext isdeterminedbycountingthenumberofwordsinthetext that match words in the lexicon and then aggregating the sentimentscoresassociatedwiththesewords.Theresulting sentimentscoreisthenusedtoclassifythetextaspositive, negative,orneutral.

There are many different lexicons available for use in sentimentanalysis,eachwithitsstrengthsandweaknesses. SomepopularlexiconsincludeSentiWordNet,theHarvardIV dictionary,andtheAFINNlexicon.

Thelexicon-basedapproachissimpletoimplementandhas been widely used in sentiment analysis. However, it has somelimitations,suchasbeinglimitedtothewordsinthe lexicon and not taking into account the context in which words are used. To overcome these limitations, other

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page96

approaches such as machine learning and deep learning modelshavebeendeveloped.

3.5. SentiWordNet

SentiWordNetisalexiconforsentimentanalysisandopinion mining.Itisamanuallyconstructed,multi-wordexpression resourcefortheEnglishlanguagethatprovidessentiment scoresforwordsandphrases.

SentiWordNetassignssentimentscorestowordsbasedon threedimensions:positivity,negativity,andobjectivity.Each word in the lexicon is associated with three sentiment scores,representingitspositivity,negativity,andobjectivity. Thescoresarebasedonthecollectivesentimentofwords thataresemanticallysimilartothewordbeingscored.

SentiWordNet can be used as a resource in sentiment analysis and opinion mining to classify the sentiment of a pieceoftextintopositive,negative,orneutralcategories.To do this, the sentiment scores of the words in the text are aggregatedtodeterminetheoverallsentimentofthetext.

SentiWordNethasbeenwidelyusedinsentimentanalysis andhasbeenshowntoperformwellincomparisontoother lexicons and machine learning models. It is a valuable resource for researchers and practitioners in the field of sentimentanalysis.

4. RESULTS AND ANALYSIS

4.1. Naive Bayes

Naive Bayes is a simple probabilistic classifier based on Bayes' Theorem. It is a popular algorithm in the field of machinelearningandiswidelyusedfortaskssuchastext classification,sentimentanalysis,andspamfiltering.

ThebasicideabehindNaiveBayesistouseBayes'Theorem tocalculatetheprobabilityofaclass(e.g.,positive,negative, orneutralsentiment)givenasetoffeatures(e.g.,wordsina text). The algorithm assumes that the features are conditionallyindependent,meaningthatthepresenceofone featuredoesnotaffectthepresenceofanotherfeature.This isthe"naive"partofthealgorithm,henceitsname.

There are several variants of the Naive Bayes algorithm, including the Multinomial Naive Bayes, Bernoulli Naive Bayes,andGaussianNaiveBayes.Eachvariantissuitedfor differenttypesofdataanddifferentclassificationtasks.

Naive Bayes is a fast and effective algorithm for text classification and sentiment analysis. It is simple to implementandrequireslittledatapreparation.However,its performance can be limited by the "naive" assumption of independence between features, which is not always accurate in practice. Despite this, Naive Bayes remains a popular and widely used algorithm in the field of text classificationandsentimentanalysis.

4.2. For Twitter Dataset

We investigate a wide rangeofcharacteristicsthat havea significantimpactonsentimentanalysis.Wehavemadeuse ofN-gramfeaturessuchasunigrams(n=1)andbigrams(n =2),whichareusedofteninavarietyoftextclassifications includingsentimentanalysis.Inthecourseofourresearch, we played around with boolean features using both unigramsandbigrams.Eachn-gramfeaturehasaboolean valuethatisconnectedwithit.Thisvalueissettotrueifand onlyifthecorrespondingn-gramappearsinthetweet[12]. The many characteristics that we have employed are outlinedinTable1,alongwiththeaccuracyresultsobtained fromeachparticularclassifier.Acomparisonofthisdataset withtheonethatPangLeeetal.utilizedfortheirresearchon moviereviewshasbeencarriedouthere.Accordingtowhat was found in Table 1, the classification accuracies that resultedfromusingunigramsasfeaturesgavebetterresults inthecaseoftweetsthanmoviereviewswhenweusedthe NB classifier with Laplace smoothing; however, when we used the MaxEnt classifier, the accuracy result of movie reviewswasmorethanthetweets.

We investigate a wide rangeofcharacteristicsthat havea significantimpactonsentimentanalysis.Wehavemadeuse ofN-gramfeaturessuchasunigrams(n=1)andbigrams(n =2),whichareusedofteninavarietyoftextclassifications includingsentimentanalysis.Inthecourseofourresearch, we played around with boolean features using both unigramsandbigrams.Eachn-gramfeaturehasaboolean valuethatisconnectedwithit.Thisvalueissettotrueifand onlyifthecorrespondingn-gramappearsinthetweet[12]. The many characteristics that we have employed are outlinedinTable1,alongwiththeaccuracyresultsobtained

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page97
Table 1: Accuracy of tweets using different features Table 2: F1 score of MNB classifier

fromeachparticularclassifier.Acomparisonofthisdataset withtheonethatPangLeeetal.utilizedfortheirresearchon moviereviewshasbeencarriedouthere.Accordingtowhat was found in Table 1, the classification accuracies that resultedfromusingunigramsasfeaturesgavebetterresults inthecaseoftweetsthanmoviereviewswhenweusedthe NB classifier with Laplace smoothing; however, when we used the MaxEnt classifier, the accuracy result of movie reviewswasmorethanthetweets.

TheeffectivenessofPOSfeatureshasbeenvalidatedusing sentimentanalysis.Asageneralrule,adjectivesareregarded asusefulcomponentsforsentimentanalysissincetheyserve as reliable indicators of a subject's feelings. Taking into account solely adjectives provides results that are comparabletothoseproducedbyemployingunigramsand bigrams,ascanbeseeninLine(5)ofthetabledisplayingthe resultsofourexperiment.Line(4)ofthetabledisplayingthe resultsdemonstratesthatwhenunigramsandPOSareused asa feature,all threeclassifiersgeneratesuperior results. Thefirstlineofthetabledisplayingtheresultsdemonstrates that using SVM with unigram as a feature yields the best result out of all the characteristics that were taken into consideration. The comprehensive findings of the MNB classifier may be seen in Table 2, which displays the F1 score.TheReceiverOperatingCharacteristic(ROC)curveof the MNB classifier is shown in Figure 1. This curve is for tweetsthathavebeenmanuallyannotated.

beenincludedinourmachine-learningalgorithmtoprovide it with more data. Figure 2 depicts a snapshot of the confusionmatrixforouremotiondataset'sunigramfeatures. Additionally, the F1 score of each class for the unigram featureisshowninthisfigure.Figure3showstheROCcurve thatwasgeneratedbyourclassifier.

4.3.

Hashtags are often used as a means for people to communicate their thoughts and feelings. Therefore, a satisfactory amount of feelings and sentiments may be gleanedfromthesehashtaggedphrases.Thesehashtagshave

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page98
Figure-1: ROC curve of MNB classifier for tweets Emotion Dataset Figure 2: Snapshot of emotion dataset Table 3: Accuracy of emotion dataset using different features Table 4: F1 score of MNB classifier for unigram feature

Figure 3: ROC curve of MNB classifier for an emotion data set

When compared to the data set that is generated by manuallyannotatingtweets,weobservedthatconstructinga dataset by automatically collecting tweets via the use of hashtagsdemonstratesa clearadvantage.Thiswasoneof thefindingsofourexperiment.Thisisbecauseauthorsare accurateabouttheirfeelings,buttheconventionalmethodof annotatingmaterialrequiresannotatorstoinferthewriters' feelingsfromthetext,whichisnotpossibletodoaccurately.

5. CONCLUSION

As part of our study, we looked at the difficulties of Sentiment Analysis and the many approaches used in this area. Identification of sentiment in social media data is notoriously challenging due to the data's richness and subtlety.Todeterminewhichcharacteristicsaremostuseful for Sentiment Analysis, we experimented using tweets collected from the public domain. We have used Machine Learningandlexicon-based algorithmsfor SA.Thegoal of our project was to make the most efficient use of the SentiWordNet vocabulary to develop a Twitter Sentiment Analysis platform. Using the SentiWordNet lexicon, we obtained an accuracy of 75.20 percent for our dataset, althoughweobservedthatthisnumbervariedsignificantly fromoneareatothenext.Becausethecurrentlexiconhasa hugenumberoftermswiththeiremotionscore,itislacking specific words that are common in a certain domain, it is preferable toconstructa lexiconfromthetestcorpusand use it for classification. Our model, which uses the Google searchenginetodetermineaterm'sscoreutilizingpointwise mutualinformation,outperformstheSentiWordNetlexicon on our dataset and can deal with one of the difficulties of SentimentAnalysis theunexpectedshiftfrompositiveto negativesentiments.

REFERENCES

[1]C.Alm,D.Roth,andR.Sproat,“Emotionsfromthetext: machine learning for text-based emotion prediction,” in ProceedingsofHLTandEMNLP.ACL,2005,pp.579–586.

[2]S.AmanandS.Szpakowicz,“UsingRoget’sthesaurusfor fine-grainedemotionrecognition,”inProceedingsofIJCNLP, 2008,pp.296

302.

[3] P. Chesley, B. Vincent, L. Xu, and R. K. Srihari, “Using verbs and adjectives to automatically classify blog sentiment,” in AAAI Spring Symposium: Computational ApproachestoAnalyzingWeblogs,2006,pp.27–29.

[4] M. D. Choudhury, S. Counts, and M. Gamon, “Not all moodsarecreatedequal!exploringhumanemotionalstates insocialmedia,”inProceedingsofICWSM,2012.

[5]R.Fan,K.Chang,C.Hsieh,X.Wang,andC.Lin,“Liblinear: A library for large linear classification,” The Journal of MachineLearningResearch,vol.9,pp.1871–1874,2008.

[6]K.Gimpel,N.Schneider,B.O’Connor,D.Das,D.Mills,J. Eisenstein, M.Heilman, D. Yogatama,J.Flanigan,and N. A. Smith, “Part-of-speech tagging for Twitter: annotation, features, and experiments,” in Proceedings of HLT: short papers,ser.HLT’11.Stroudsburg,PA,USA:ACL,2011,pp. 42–47.

[7]M.Hall,E.Frank,G.Holmes,B.Pfahringer,P.Reutemann, andI.Witten,“Thewekadataminingsoftware:anupdate,” ACMSIGKDDExplorationsNewsletter,vol.11,no.1,pp.10–18,2009.

[8]G.Mishne,“Experimentswithmoodclassificationinblog posts,” in Proceedings of ACM SIGIR 2005 Workshop on StylisticAnalysisofTextforInformationAccess.

[9]S.Mohammad,“#emotionaltweets,”inProceedingsofthe SixthInternationalWorkshoponSemanticEvaluation.ACL, 7-8June2012,pp.246–255.

[10] A. Neviarouskaya, H. Prendinger, and M. Ishizuka, “Affectanalysismodel:Anovelrule-basedapproachtoaffect sensingfromtext,”Natural LanguageEngineering,vol. 17, no.1,pp.95–135,2011.

[11] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentimentclassificationusingmachinelearningtechniques,” inProceedingsofEMNLP.ACL,2002,pp.79–86.

[12] P. Shaver, J. Schwartz, D. Kirson, and C. O’Connor, “Emotion knowledge: Further exploration of a prototype approach.”Journalofpersonalityandsocialpsychology,vol. 52,no.6,pp.1061

1086,1987.

[13] C. Strapparava and R. Mihalcea, “Learning to identify emotions in text,” in Proceedings of the 2008 ACM

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page99
–
–

symposium on Applied computing. ACM, 2008, pp. 1556–1560.

[14] C. Strapparava and A. Valitutti, “Wordnet-affect: an affectiveextensionofwordnet,”inProceedingsofLREC,vol. 4.Citeseer,2004,pp.1083

1086.

[15]C.StrapparavaandR.Mihalcea,“Semeval-2007task14: affective text,” in Proceedings of the 4th International WorkshoponSemanticEvaluations,ser.SemEval’07,2007, pp.70–74.

[16] R. Tokuhisa, K. Inui, and Y. Matsumoto, “Emotion classification using massive examples extracted from the web,”inProceedingsofCOLING.ACL,2008,pp.881–888.

[17] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in phrase-level sentiment analysis,” in ProceedingsofHLTandEMNLP.ACL,2005,pp.347–354.

[18]I.Witten,E.Frank,andM.Hall,DataMining:Practical machinelearningtoolsandtechniques.MorganKaufmann, 2011.

[19] C. Yang, K. Lin, and H. Chen, “Emotion classification using web blog corpora,” in IEEE/WIC/ACM International ConferenceonWebIntelligence.IEEE,2007,pp.275

278.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 02 | Feb 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page100
–
–

Turn static files into dynamic content formats.

Create a flipbook