
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
Palak Kaushal1, Anita Ganpati2
1 Palak Kaushal, Himachal Pradesh University, Shimla, India
2Anita Ganpati, Himachal Pradesh University, Shimla, India
Abstract - Evaluation metrics are crucial for assessing the performance and reliability of sentiment analysis models in various applications. Evaluation metrics are critical for appraising sentiment analysis models performance and guaranteeing their dependability in various applications. The research thoroughly examined the classification, regression, ranking, and explainability metrics. Every measure has advantages and disadvantages that affect how well sentiment categorisation and forecasting assignments work. These measures are compared, providing insight into their effectiveness in various sentiment analysis contexts. Future studies should concentrate on fairness-driven and context-aware assessment methods to expand the reliability and interpretability of classification models.
Key Words: Sentiment Analysis, Evaluation Metrics, Classification Metrics, Regression Metrics, Ranking Metrics.
SentimentAnalysis,abranchofNaturalLanguageProcessing(NLP)[1],aimstoconstrueandassessemotions,opinions,and attitudesconveyedintextualdata[2].Sentimentanalysisisbecomingacrucialtoolinmanyfields,suchasbusinessintelligence, consumerfeedbackanalysis,andsocialmediamonitoring,duetotheexplosiveexpansionofdigitalmaterialonsocialmedia,ecommerceplatforms,andonlinereviews.Businessesutilisesentimentanalysistogaininsightintopublicopinion,makebetter decisions,andimproveuserexperience.Assessingtheusefulnessanddependabilityofsentimentcategorisationmodelsisa crucialcomponentofsentimentanalysis.Amodel'sperformanceinnumeroustasksisresolutebytheassessmentmeasuresit uses.Differentcriteriaforevaluationareneededdependingonwhetherthetaskrequiresclassification,regression,orranking.
Moreover,explainabilityandfairnessmeasureshavebecomemorecrucialforensuringtransparencyandobjective judgementsbecauseadvancedlearningandblack-boxmodelsareusedmoreofteninsentimentanalysis.Classificationmetrics, regressionmetrics,rankingmetrics,andexplainability/fairnessmeasurementsarethefourprimarycategoriesintowhichthis studydividessentimentanalysisassessmentmetrics.Thisstudyintendstoassistresearchersinchoosingsuitableassessment techniquesdependingonthenatureoftheirsentimentanalysisjobsbyofferinganorganisedsummaryofvariousmeasures.
Sentimentanalysismodelassessmenthasbeendeeplystudiedandseveralmeasureshavebeenputouttoevaluateperformance ontasksinvolvingclassification,regression,andranking.Theimportantresearchthathasinfluencedthecreationofsentiment analysis assessment metrics is reviewed in this section. Most popular sentiment analysis responsibility is sentiment investigation,whichisclassifyingtextintopredeterminedsentimentcategories.Usingaccuracyasthecoreassessmentcriterion, [2]used conventional learning classifiers, such as Naive Bayes, Support Vector Machines (SVM), and Maximum Entropy. Precision, recall, and F1-score have been adopted as more informative metrics because of the criticism of accuracy's shortcomingsinunbalanceddatasets[3]
Deeplearning-basedsentimentclassificationhasfurtheremphasizedtheneedforrobustevaluationmetrics.LongShortTermMemory(LSTMs)networks,introducedby[4]establishedstrongperformanceincapturingcontextualdependenciesin sentimentclassification.Thepaper[5]evaluatessentimentclassificationusingaccuracy,precision,recall,F1-score,andROC-AUC. The ensemble model outperformed individual classifiers, achieving high F1-score and AUC, reducing misclassification, and improvingsentimentdetectioninArabicsocialmediatext.Thepaper[6]evaluatessentimentanalysismodelsusingaccuracy, precision,recall,andF1-scoretocomparetheireffectivenessinrecognizingemotionalcontent.Theresultshighlightthatdeep learningmodelsbeattraditionalprocesses,achievinghigherprecisionandrecall,makingthemmoresuitableforsentiment classificationtasks.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
3.1 Document-Level Sentiment Analysis
Determinestheoverallsentimentofanentiredocument(e.g.,aproductreview,blogpost)[7].
UseCase:Classifyingmoviereviewsaspositiveornegative. Limitation:Failstodetectmultiplesentimentsinlongertexts.
3.2 Sentence-Level Sentiment Analysis
Analyzessentimentexpressedinindividualsentences[8]
UseCase:Twittersentimentclassification,headlineanalysis. Challenge:Detectingsarcasmorimplicitsentimentinshorttexts.
3. 3 Aspect-Level (or Feature-Level) Sentiment Analysis
Identifiessentimenttowardspecificaspects/featuresofaproductorservicewithintext[9].
UseCase:Inareviewlike “The camera is amazing, but the battery life is poor,” aspect-levelanalysiscantag"camera"aspositive and"battery"asnegative.
Strength:Providesgranularinsightsforbusinesses.
3.4 Phrase-Level Sentiment Analysis
Assignssentimentpolaritytosmallersyntacticunitslikephrases[10]
UseCase: “Not very good” →negativesentimentatphraselevel,thoughindividualwordsmaysuggestotherwise.
Challenge:Requiresparsingandunderstandingmodifiersandnegation.
Table1:Showsthelevelofsentimentwithusecasesandgranularity.
Level Granularity
Document-Level[7]
Entiredocument
Sentence-Level[8] Individualsentence
Aspect-Level[9] Specificfeature
Phrase-Level[10] Word/phrase
4.1 Lexicon-Based Techniques
UseCase
Productreviews
Tweets,headlines
Productaspectfeedback
Negationhandling
Usepredefineddictionariesofwordswhereeachwordisassociatedwithasentimentscore(positive,negative,neutral)[11]
Types:
o Dictionary-based:Manuallycurated(e.g.,SentiWordNet,NRC).
o Corpus-based:Scoresderivedfromlargecorporausingstatisticalorco-occurrencemethods.
Strengths:Language-agnostic,interpretable.
Limitations:Struggleswithsarcasm,negation,anddomain-specificterms.
4.2 Machine Learning-Based Techniques
Usetraditionalsupervisedlearningalgorithmstotrainsentimentclassifiersonlabeleddata[2]
Common algorithms:NaiveBayes,SVM,LogisticRegression,DecisionTrees.
Steps:Featureextraction(e.g.,Bag-of-Words,TF-IDF)→Modeltraining→Prediction. Strengths:Adaptabletospecificdatasets.
Limitations:Requireslargelabeleddatasets,lessinterpretable.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
4.3 Deep Learning-Based Techniques
Automaticallylearncomplexpatternsintextusingneuralnetworks[12].
Popular architectures:
CNN:Captureslocalwordpatternsandphrases.
RNN/LSTM/GRU:Capturessequentialdependenciesintext.
Attention Mechanisms:Focusonimportantwords.
Strengths:Outperformstraditionalmodels,capturescontext.
Limitations:Computationallyexpensive,needslargedata.
4.4 Transformer-Based Techniques
Usepre-trainedlanguagemodelslikeBERT,RoBERTa,andDistilBERTfine-tunedforsentimentclassification[13]
Advantage:Understandbidirectionalcontextandsemanticrelationships. Examples:BERTfine-tunedonIMDb,SST-2,Twittersentimentdatasets.
Strengths:State-of-the-artaccuracy,minimalfeatureengineering.
Limitations:Requireshighcomputationalresources.
4.5 Hybrid Approaches
Combinelexicon-basedandmachine/deeplearningmethodstoleveragethestrengthsofboth[14]
Use Case:Lexiconhelpswithinterpretability,ML/DLenhancesaccuracy. Example:UselexiconscoresasfeaturesinanMLclassifier.
Table2:definesthetechnique,descriptionandkeymethods.
Technique
Description
Lexicon-Based[11] Usespredefinedwordlists
KeyMethods
SentiWordNet,NRC
MachineLearning[15] Supervisedlearningwithfeatures NaiveBayes,SVM
DeepLearning[12] Neuralnetworksforsequential/semanticlearning CNN,LSTM,GRU
Transformer-Based[13] Contextuallanguagemodels BERT,RoBERTa
Hybrid[14] Combineslexicon+ML/DL
5.1 Classification Metrics[16]
Accuracy
Ensemble,lexiconfeatures
Accuracy is the most used metric in sentiment analysis and measures the proportion of correctly classified instances[17].
Accuracyisusefulforbalanceddatasets,butitisunreliableforimbalancedsentimentdatasets.
Precision, Recall, and F1-Score
Precision measureshowmanypredictedpositiveoccurrencesarepositive:
Recall (orSensitivity)evaluateshowmanyactualpositiveinstancesarecorrectlyclassified
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
F1-Score stabilities,precision,andrecall,providingasolitarypresentationmeasure
F1-score is particularly important for datasets with class imbalances. F1-score is widely used in sentiment classification benchmarks[18]
ROC-AUC
ROC-AUCappraisesthebalancebetweenexactpositiverate(TPR)andincorrectpositiverate(FPR)acrossdifferentclassification thresholds[19].Itisusefulforcomparingmodelperformanceacrossdifferentdecisionboundariesbutmaynotbewell-suitedfor multi-classsentimentclassification.
5.2 Regression Metrics for Sentiment Scoring[20]
Sometasksassignsentimentscoresratherthandiscreteclasses.Insuchcases,regressionmetricsareused.
Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)
ThemeansquaredvarianceamongobservedandexpectedsentimentevaluationsismeasuredbyMSE.
MSE=
AmeasureoftransparenterrorinthesameunitasoriginalsentimentratingsisprovidedbyRMSE,whichisthesquarerootof MSE[21]
Mean Absolute Error (MAE)
The ratio percentage differences between estimated and observed sentiment ratings are computed[22]. MAE is more aggressiveagainstoutliersassociatedwithMSE,butitfailstocompensateforsubstantialmistakesaswell.
5.3 Ranking Metrics for Sentiment Ordering
Metricsforrankingevaluatealgorithmsthatforecastsentimentrankingsratherthanclassifications.
Spearman’s Rank Correlation Coefficient
Thedegreetowhichsentimentrankingsremainconsistentwithinexpectedandobservedlevelsofsentimentisassessedby Spearman'scorrelation[23].Whensentimentneedstoberankedinsteadofdefined,itishelpful.
Kendall’s Tau
Kendall'sTauisarankingstatisticthatevaluateshowwellprojectedsentimentrankingsmatchgroundtruthrankingssuitedfor aspect-basedsentimentanalysis[24]
5.4 Explainability and Fairness Metrics
SHapley Additive Explanations (SHAP)
ThecontributionofeachfeaturetosentimentcategorisationisexplainedbySHAPvalues,whichaidintheinterpretationof modeldecisions[25].Itraisesthemodel'sintegrityandopenness.
Fairness Metrics
Fairnessisparticularlyimperative whenassessingviewsfromindividuals.Independenteffectandchancevariancearetwo examplesofdetectionofbiasthatevaluatehowwellalgorithmshandleallsentimentgroups[26].
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
4. COMPARATIVE ANALYSIS OF EVALUATION METRICS FOR SENTIMENT ANALYSIS
Fig. 1: Comparative Analysis of Evaluation Metrics for Sentiment Analysis
Thecomparisonstudydemonstratesthatthereisnotasinglemetricthatworksbestforanalysingsentiment.MSE/RMSEare frequentlyusedforsentimentalityratingestimates,buttheyaredisposedtooutliers,whereasF1-scoreisthebestmethodfor unbalancedissuesinclassification.
WhilefairnessmeasuresandSHAPparametersimprovemodelinterpretabilityandbiasrecognition,rankingmetricssuch asSpearman'scorrelationarehelpfulforaspect-basedsentimentassessment.Ahybridevaluationstrategythatincorporates manymeasuresguaranteesamoreaccurateevaluation.Toincreasesentimentanalysis'saccuracyandresilience,futurestudies needtofocusoncontext-aware,fairness-driven,anddomain-specificevaluationmethods.
Sentimentanalysismodelsmustbeevaluatedusingtherightmeasurestoguaranteedependabilityacrossvariousprofessions. Keyassessmentcriteriawerediscussedinthisresearch,withanemphasisontheiradvantagesanddisadvantages.Multilingual emotion,sarcasm,andsocioeconomicdisparityarestillmajorissues.Toincreasetheeffectivenessandfairnessofsentiment models,futurestudiesshouldconcentrateoncontext-aware,bias-mitigating,andhybridassessmenttechniques.
[1] “Christopher_D._Manning_Hinrich_Schütze_Foundations_Of_Statistical_Natural_Language_Processing”.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072
[2] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques.” [Online].Available:http://reviews.imdb.com/Reviews/
[3] F. Sebastiani, “Machine Learning in Automated Text Categorization,” 2001. [Online]. Available: http://liinwww.ira.uka.de/bibliography/Ai/automated.text.categorization.html
[4] S.HochreiterandJ.Schmidhuber,“LongShort-TermMemory,” Neural Comput,vol.9,no.8,pp.1735–1780,Nov.1997,doi: 10.1162/neco.1997.9.8.1735.
[5] N. Hicham, S. Karim, and N. Habbat, “Customer sentiment analysis for Arabic social media using a novel ensemble machinelearningapproach,” International Journal of Electrical and Computer Engineering,vol.13,no.4,pp.4504–4515, Aug.2023,doi:10.11591/ijece.v13i4.pp4504-4515.
[6] N.S.I.P.andM.P.K.Kyritsis,“AComparativePerformanceEvaluationofAlgorithmsfortheAnalysisandRecognitionof EmotionalContent’,ArtificialIntelligence.,” IntechOpen,Jan.2024.
[7] B.PangandL.Lee,“Opinionminingandsentimentanalysis,”2008.
[8] B.Liu,“SentimentAnalysisandOpinionMining,”Morgan&ClaypoolPublishers,2012.
[9] M.Pontiki,H.Papageorgiou,D.Galanis,I.Androutsopoulos,J.Pavlopoulos,andS.Manandhar,“SemEval-2014Task4: AspectBasedSentimentAnalysis,”2014.[Online].Available:http://alt.qcri.
[10] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis,” 2005. [Online].Available:http://www.cs.pitt.edu/
[11] M.Taboada,J.Brooke,M.Tofiloski,K.Voll,andM.Stede,“Lexicon-BasedMethodsforSentimentAnalysis,”2011.
[12] D.Tang,B.Qin,andT.Liu,“DocumentModelingwithGatedRecurrentNeuralNetworkforSentimentClassification,” AssociationforComputationalLinguistics,2015.[Online].Available:http://ir.hit.edu.cn/
[13] J.Devlin,M.-W.Chang,K.Lee,K.T.Google,andA.I.Language,“BERT:Pre-trainingofDeepBidirectionalTransformersfor LanguageUnderstanding,”2019.[Online].Available:https://github.com/tensorflow/tensor2tensor
[14] O.Araque,I.Corcuera-Platas,J.F.Sánchez-Rada,andC.A.Iglesias,“Enhancingdeeplearningsentimentanalysiswith ensemble techniques in social applications,” Expert Syst Appl, vol. 77, pp. 236–246, Jul. 2017, doi: 10.1016/j.eswa.2017.02.002.
[15] B.Pang,L.Lee,andS.Vaithyanathan,“Thumbsup?SentimentClassificationusingMachineLearningTechniques,”2002. [Online].Available:http://reviews.imdb.com/Reviews/
[16] RaghavAggarwal,“https://www.searchunify.com/sudo-technical-blogs/how-to-measure-the-efficacy-of-your-sentimentanalysis-model/.”
[17] “https://www.linkedin.com/advice/1/how-can-you-evaluate-sentiment-analysis-model-ygfec.”
[18] M.SokolovaandG.Lapalme,“Asystematicanalysisofperformancemeasuresforclassificationtasks,” Inf Process Manag, vol.45,no.4,pp.427–437,Jul.2009,doi:10.1016/J.IPM.2009.03.002.
[19] “Evaluation_From_Precision_Recall_and_F-Factor_to_R(2)”.
[20] “KnowTheBestEvaluationMetricsforYourRegressionModel !”
[21] T.ChaiandR.R.Draxler,“Rootmeansquareerror(RMSE)ormeanabsoluteerror(MAE)?-Argumentsagainstavoiding RMSEintheliterature,” Geosci Model Dev,vol.7,no.3,pp.1247–1250,Jun.2014,doi:10.5194/gmd-7-1247-2014.
[22] W.WangandY.Lu,“AnalysisoftheMeanAbsoluteError(MAE)andtheRootMeanSquareError(RMSE)inAssessing RoundingModel,”in IOP Conference Series: Materials Science and Engineering,InstituteofPhysicsPublishing,Apr.2018. doi:10.1088/1757-899X/324/1/012049.
2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page881
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
[23] A. C. Leon, “Descriptive and Inferential Statistics,” Comprehensive Clinical Psychology, pp. 243–285, 1998, doi: 10.1016/B0080-4270(73)00264-9.
[24] M.G.KENDALL,“ANEWMEASUREOFRANKCORRELATION,,” Biometrika,vol.30,no.1–2,pp.81–93,Jun.1938.
[25] S.M.Lundberg, P.G.Allen,andS.-I.Lee,“AUnifiedApproachtoInterpretingModel Predictions.”[Online].Available: https://github.com/slundberg/shap
[26] M.H.andA.N.S.Barocas, Fairness and Machine Learning: Limitations and Opportunities.MITPress,2021.
Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072 © 2025, IRJET | Impact Factor value: 8.315 | ISO 9001:2008 Certified Journal | Page882