Evaluation Metrics for Sentiment Analysis: A Comprehensive Review and Future Directions by IRJET Journal

Evaluation Metrics for Sentiment Analysis: A Comprehensive Review and Future Directions

Palak Kaushal1, Anita Ganpati2

1 Palak Kaushal, Himachal Pradesh University, Shimla, India

2Anita Ganpati, Himachal Pradesh University, Shimla, India

Abstract - Evaluation metrics are crucial for assessing the performance and reliability of sentiment analysis models in various applications. Evaluation metrics are critical for appraising sentiment analysis models performance and guaranteeing their dependability in various applications. The research thoroughly examined the classification, regression, ranking, and explainability metrics. Every measure has advantages and disadvantages that affect how well sentiment categorisation and forecasting assignments work. These measures are compared, providing insight into their effectiveness in various sentiment analysis contexts. Future studies should concentrate on fairness-driven and context-aware assessment methods to expand the reliability and interpretability of classification models.

Key Words: Sentiment Analysis, Evaluation Metrics, Classification Metrics, Regression Metrics, Ranking Metrics.

1. INTRODUCTION

SentimentAnalysis,abranchofNaturalLanguageProcessing(NLP)[1],aimstoconstrueandassessemotions,opinions,and attitudesconveyedintextualdata[2].Sentimentanalysisisbecomingacrucialtoolinmanyfields,suchasbusinessintelligence, consumerfeedbackanalysis,andsocialmediamonitoring,duetotheexplosiveexpansionofdigitalmaterialonsocialmedia,ecommerceplatforms,andonlinereviews.Businessesutilisesentimentanalysistogaininsightintopublicopinion,makebetter decisions,andimproveuserexperience.Assessingtheusefulnessanddependabilityofsentimentcategorisationmodelsisa crucialcomponentofsentimentanalysis.Amodel'sperformanceinnumeroustasksisresolutebytheassessmentmeasuresit uses.Differentcriteriaforevaluationareneededdependingonwhetherthetaskrequiresclassification,regression,orranking.

Moreover,explainabilityandfairnessmeasureshavebecomemorecrucialforensuringtransparencyandobjective judgementsbecauseadvancedlearningandblack-boxmodelsareusedmoreofteninsentimentanalysis.Classificationmetrics, regressionmetrics,rankingmetrics,andexplainability/fairnessmeasurementsarethefourprimarycategoriesintowhichthis studydividessentimentanalysisassessmentmetrics.Thisstudyintendstoassistresearchersinchoosingsuitableassessment techniquesdependingonthenatureoftheirsentimentanalysisjobsbyofferinganorganisedsummaryofvariousmeasures.

2. LITERATURE REVIEW

Sentimentanalysismodelassessmenthasbeendeeplystudiedandseveralmeasureshavebeenputouttoevaluateperformance ontasksinvolvingclassification,regression,andranking.Theimportantresearchthathasinfluencedthecreationofsentiment analysis assessment metrics is reviewed in this section. Most popular sentiment analysis responsibility is sentiment investigation,whichisclassifyingtextintopredeterminedsentimentcategories.Usingaccuracyasthecoreassessmentcriterion, [2]used conventional learning classifiers, such as Naive Bayes, Support Vector Machines (SVM), and Maximum Entropy. Precision, recall, and F1-score have been adopted as more informative metrics because of the criticism of accuracy's shortcomingsinunbalanceddatasets[3]

Deeplearning-basedsentimentclassificationhasfurtheremphasizedtheneedforrobustevaluationmetrics.LongShortTermMemory(LSTMs)networks,introducedby[4]establishedstrongperformanceincapturingcontextualdependenciesin sentimentclassification.Thepaper[5]evaluatessentimentclassificationusingaccuracy,precision,recall,F1-score,andROC-AUC. The ensemble model outperformed individual classifiers, achieving high F1-score and AUC, reducing misclassification, and improvingsentimentdetectioninArabicsocialmediatext.Thepaper[6]evaluatessentimentanalysismodelsusingaccuracy, precision,recall,andF1-scoretocomparetheireffectivenessinrecognizingemotionalcontent.Theresultshighlightthatdeep learningmodelsbeattraditionalprocesses,achievinghigherprecisionandrecall,makingthemmoresuitableforsentiment classificationtasks.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

3. SENTIMENT ANALYSIS LEVELS

3.1 Document-Level Sentiment Analysis

Determinestheoverallsentimentofanentiredocument(e.g.,aproductreview,blogpost)[7].

UseCase:Classifyingmoviereviewsaspositiveornegative. Limitation:Failstodetectmultiplesentimentsinlongertexts.

3.2 Sentence-Level Sentiment Analysis

Analyzessentimentexpressedinindividualsentences[8]

UseCase:Twittersentimentclassification,headlineanalysis. Challenge:Detectingsarcasmorimplicitsentimentinshorttexts.

3. 3 Aspect-Level (or Feature-Level) Sentiment Analysis

Identifiessentimenttowardspecificaspects/featuresofaproductorservicewithintext[9].

UseCase:Inareviewlike “The camera is amazing, but the battery life is poor,” aspect-levelanalysiscantag"camera"aspositive and"battery"asnegative.

Strength:Providesgranularinsightsforbusinesses.

3.4 Phrase-Level Sentiment Analysis

Assignssentimentpolaritytosmallersyntacticunitslikephrases[10]

UseCase: “Not very good” →negativesentimentatphraselevel,thoughindividualwordsmaysuggestotherwise.

Challenge:Requiresparsingandunderstandingmodifiersandnegation.

Table1:Showsthelevelofsentimentwithusecasesandgranularity.

Level Granularity

Document-Level[7]

Entiredocument

Sentence-Level[8] Individualsentence

Aspect-Level[9] Specificfeature

Phrase-Level[10] Word/phrase

4. SENTIMENT ANALYSIS TECHNIQUES

4.1 Lexicon-Based Techniques

UseCase

Productreviews

Tweets,headlines

Productaspectfeedback

Negationhandling

Usepredefineddictionariesofwordswhereeachwordisassociatedwithasentimentscore(positive,negative,neutral)[11]

Types:

o Dictionary-based:Manuallycurated(e.g.,SentiWordNet,NRC).

o Corpus-based:Scoresderivedfromlargecorporausingstatisticalorco-occurrencemethods.

Strengths:Language-agnostic,interpretable.

Limitations:Struggleswithsarcasm,negation,anddomain-specificterms.

4.2 Machine Learning-Based Techniques

Usetraditionalsupervisedlearningalgorithmstotrainsentimentclassifiersonlabeleddata[2]

Common algorithms:NaiveBayes,SVM,LogisticRegression,DecisionTrees.

Steps:Featureextraction(e.g.,Bag-of-Words,TF-IDF)→Modeltraining→Prediction. Strengths:Adaptabletospecificdatasets.

Limitations:Requireslargelabeleddatasets,lessinterpretable.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

4.3 Deep Learning-Based Techniques

Automaticallylearncomplexpatternsintextusingneuralnetworks[12].

Popular architectures:

 CNN:Captureslocalwordpatternsandphrases.

 RNN/LSTM/GRU:Capturessequentialdependenciesintext.

 Attention Mechanisms:Focusonimportantwords.

Strengths:Outperformstraditionalmodels,capturescontext.

Limitations:Computationallyexpensive,needslargedata.

4.4 Transformer-Based Techniques

Usepre-trainedlanguagemodelslikeBERT,RoBERTa,andDistilBERTfine-tunedforsentimentclassification[13]

Advantage:Understandbidirectionalcontextandsemanticrelationships. Examples:BERTfine-tunedonIMDb,SST-2,Twittersentimentdatasets.

Strengths:State-of-the-artaccuracy,minimalfeatureengineering.

Limitations:Requireshighcomputationalresources.

4.5 Hybrid Approaches

Combinelexicon-basedandmachine/deeplearningmethodstoleveragethestrengthsofboth[14]

Use Case:Lexiconhelpswithinterpretability,ML/DLenhancesaccuracy. Example:UselexiconscoresasfeaturesinanMLclassifier.

Table2:definesthetechnique,descriptionandkeymethods.

Technique

Description

Lexicon-Based[11] Usespredefinedwordlists

KeyMethods

SentiWordNet,NRC

MachineLearning[15] Supervisedlearningwithfeatures NaiveBayes,SVM

DeepLearning[12] Neuralnetworksforsequential/semanticlearning CNN,LSTM,GRU

Transformer-Based[13] Contextuallanguagemodels BERT,RoBERTa

Hybrid[14] Combineslexicon+ML/DL

5. PERFORMANCE EVALUATION METRICS

5.1 Classification Metrics[16]

Accuracy

Ensemble,lexiconfeatures

Accuracy is the most used metric in sentiment analysis and measures the proportion of correctly classified instances[17].

Accuracyisusefulforbalanceddatasets,butitisunreliableforimbalancedsentimentdatasets.

Precision, Recall, and F1-Score

Precision measureshowmanypredictedpositiveoccurrencesarepositive:

Recall (orSensitivity)evaluateshowmanyactualpositiveinstancesarecorrectlyclassified

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

F1-Score stabilities,precision,andrecall,providingasolitarypresentationmeasure

F1-score is particularly important for datasets with class imbalances. F1-score is widely used in sentiment classification benchmarks[18]

ROC-AUC

ROC-AUCappraisesthebalancebetweenexactpositiverate(TPR)andincorrectpositiverate(FPR)acrossdifferentclassification thresholds[19].Itisusefulforcomparingmodelperformanceacrossdifferentdecisionboundariesbutmaynotbewell-suitedfor multi-classsentimentclassification.

5.2 Regression Metrics for Sentiment Scoring[20]

Sometasksassignsentimentscoresratherthandiscreteclasses.Insuchcases,regressionmetricsareused.

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)

ThemeansquaredvarianceamongobservedandexpectedsentimentevaluationsismeasuredbyMSE.

MSE=

AmeasureoftransparenterrorinthesameunitasoriginalsentimentratingsisprovidedbyRMSE,whichisthesquarerootof MSE[21]

Mean Absolute Error (MAE)

The ratio percentage differences between estimated and observed sentiment ratings are computed[22]. MAE is more aggressiveagainstoutliersassociatedwithMSE,butitfailstocompensateforsubstantialmistakesaswell.

5.3 Ranking Metrics for Sentiment Ordering

Metricsforrankingevaluatealgorithmsthatforecastsentimentrankingsratherthanclassifications.

Spearman’s Rank Correlation Coefficient

Thedegreetowhichsentimentrankingsremainconsistentwithinexpectedandobservedlevelsofsentimentisassessedby Spearman'scorrelation[23].Whensentimentneedstoberankedinsteadofdefined,itishelpful.

Kendall’s Tau

Kendall'sTauisarankingstatisticthatevaluateshowwellprojectedsentimentrankingsmatchgroundtruthrankingssuitedfor aspect-basedsentimentanalysis[24]

5.4 Explainability and Fairness Metrics

SHapley Additive Explanations (SHAP)

ThecontributionofeachfeaturetosentimentcategorisationisexplainedbySHAPvalues,whichaidintheinterpretationof modeldecisions[25].Itraisesthemodel'sintegrityandopenness.

Fairness Metrics

Fairnessisparticularlyimperative whenassessingviewsfromindividuals.Independenteffectandchancevariancearetwo examplesofdetectionofbiasthatevaluatehowwellalgorithmshandleallsentimentgroups[26].

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

4. COMPARATIVE ANALYSIS OF EVALUATION METRICS FOR SENTIMENT ANALYSIS

Fig. 1: Comparative Analysis of Evaluation Metrics for Sentiment Analysis

Thecomparisonstudydemonstratesthatthereisnotasinglemetricthatworksbestforanalysingsentiment.MSE/RMSEare frequentlyusedforsentimentalityratingestimates,buttheyaredisposedtooutliers,whereasF1-scoreisthebestmethodfor unbalancedissuesinclassification.

WhilefairnessmeasuresandSHAPparametersimprovemodelinterpretabilityandbiasrecognition,rankingmetricssuch asSpearman'scorrelationarehelpfulforaspect-basedsentimentassessment.Ahybridevaluationstrategythatincorporates manymeasuresguaranteesamoreaccurateevaluation.Toincreasesentimentanalysis'saccuracyandresilience,futurestudies needtofocusoncontext-aware,fairness-driven,anddomain-specificevaluationmethods.

6. CONCLUSION

Sentimentanalysismodelsmustbeevaluatedusingtherightmeasurestoguaranteedependabilityacrossvariousprofessions. Keyassessmentcriteriawerediscussedinthisresearch,withanemphasisontheiradvantagesanddisadvantages.Multilingual emotion,sarcasm,andsocioeconomicdisparityarestillmajorissues.Toincreasetheeffectivenessandfairnessofsentiment models,futurestudiesshouldconcentrateoncontext-aware,bias-mitigating,andhybridassessmenttechniques.

7. REFERENCES

[1] “Christopher_D._Manning_Hinrich_Schütze_Foundations_Of_Statistical_Natural_Language_Processing”.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

[2] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques.” [Online].Available:http://reviews.imdb.com/Reviews/

[3] F. Sebastiani, “Machine Learning in Automated Text Categorization,” 2001. [Online]. Available: http://liinwww.ira.uka.de/bibliography/Ai/automated.text.categorization.html

[4] S.HochreiterandJ.Schmidhuber,“LongShort-TermMemory,” Neural Comput,vol.9,no.8,pp.1735–1780,Nov.1997,doi: 10.1162/neco.1997.9.8.1735.

[5] N. Hicham, S. Karim, and N. Habbat, “Customer sentiment analysis for Arabic social media using a novel ensemble machinelearningapproach,” International Journal of Electrical and Computer Engineering,vol.13,no.4,pp.4504–4515, Aug.2023,doi:10.11591/ijece.v13i4.pp4504-4515.

[6] N.S.I.P.andM.P.K.Kyritsis,“AComparativePerformanceEvaluationofAlgorithmsfortheAnalysisandRecognitionof EmotionalContent’,ArtificialIntelligence.,” IntechOpen,Jan.2024.

[7] B.PangandL.Lee,“Opinionminingandsentimentanalysis,”2008.

[8] B.Liu,“SentimentAnalysisandOpinionMining,”Morgan&ClaypoolPublishers,2012.

[9] M.Pontiki,H.Papageorgiou,D.Galanis,I.Androutsopoulos,J.Pavlopoulos,andS.Manandhar,“SemEval-2014Task4: AspectBasedSentimentAnalysis,”2014.[Online].Available:http://alt.qcri.

[10] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis,” 2005. [Online].Available:http://www.cs.pitt.edu/

[11] M.Taboada,J.Brooke,M.Tofiloski,K.Voll,andM.Stede,“Lexicon-BasedMethodsforSentimentAnalysis,”2011.

[12] D.Tang,B.Qin,andT.Liu,“DocumentModelingwithGatedRecurrentNeuralNetworkforSentimentClassification,” AssociationforComputationalLinguistics,2015.[Online].Available:http://ir.hit.edu.cn/

[13] J.Devlin,M.-W.Chang,K.Lee,K.T.Google,andA.I.Language,“BERT:Pre-trainingofDeepBidirectionalTransformersfor LanguageUnderstanding,”2019.[Online].Available:https://github.com/tensorflow/tensor2tensor

[14] O.Araque,I.Corcuera-Platas,J.F.Sánchez-Rada,andC.A.Iglesias,“Enhancingdeeplearningsentimentanalysiswith ensemble techniques in social applications,” Expert Syst Appl, vol. 77, pp. 236–246, Jul. 2017, doi: 10.1016/j.eswa.2017.02.002.

[15] B.Pang,L.Lee,andS.Vaithyanathan,“Thumbsup?SentimentClassificationusingMachineLearningTechniques,”2002. [Online].Available:http://reviews.imdb.com/Reviews/

[16] RaghavAggarwal,“https://www.searchunify.com/sudo-technical-blogs/how-to-measure-the-efficacy-of-your-sentimentanalysis-model/.”

[17] “https://www.linkedin.com/advice/1/how-can-you-evaluate-sentiment-analysis-model-ygfec.”

[18] M.SokolovaandG.Lapalme,“Asystematicanalysisofperformancemeasuresforclassificationtasks,” Inf Process Manag, vol.45,no.4,pp.427–437,Jul.2009,doi:10.1016/J.IPM.2009.03.002.

[19] “Evaluation_From_Precision_Recall_and_F-Factor_to_R(2)”.

[20] “KnowTheBestEvaluationMetricsforYourRegressionModel !”

[21] T.ChaiandR.R.Draxler,“Rootmeansquareerror(RMSE)ormeanabsoluteerror(MAE)?-Argumentsagainstavoiding RMSEintheliterature,” Geosci Model Dev,vol.7,no.3,pp.1247–1250,Jun.2014,doi:10.5194/gmd-7-1247-2014.

[22] W.WangandY.Lu,“AnalysisoftheMeanAbsoluteError(MAE)andtheRootMeanSquareError(RMSE)inAssessing RoundingModel,”in IOP Conference Series: Materials Science and Engineering,InstituteofPhysicsPublishing,Apr.2018. doi:10.1088/1757-899X/324/1/012049.

2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page881

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

[23] A. C. Leon, “Descriptive and Inferential Statistics,” Comprehensive Clinical Psychology, pp. 243–285, 1998, doi: 10.1016/B0080-4270(73)00264-9.

[24] M.G.KENDALL,“ANEWMEASUREOFRANKCORRELATION,,” Biometrika,vol.30,no.1–2,pp.81–93,Jun.1938.

[25] S.M.Lundberg, P.G.Allen,andS.-I.Lee,“AUnifiedApproachtoInterpretingModel Predictions.”[Online].Available: https://github.com/slundberg/shap

[26] M.H.andA.N.S.Barocas, Fairness and Machine Learning: Limitations and Opportunities.MITPress,2021.