A Review of Stock Price Prediction Using Machine Learning Techniques by IRJET Journal

A Review of Stock Price Prediction Using Machine Learning Techniques

Tan Chun Fui1 , Tan Lay Hong2 , Ajay Kumar Singh3

1 Senior Lecturer, Faculty of Information Science Technology, Multimedia University-Jalan Ayer Keroh Lama, Melaka, Malaysia

2 Senior Lecturer, Universiti Teknikal Malaysia Melaka (UTeM), Fakulti Pengurusan Teknologi Dan Teknousahawanan (FPTT), Centre of Technopreneurship Development (CTeD), 75450 Ayer Keroh, Melaka, Malaysia

3 Professor, Electronics and Communication Engineering NIIT University, Alwar, Rajasthan India.

Abstract - This paper reviews existing literature on predicting stock prices using machine learning techniques, emphasizing the growing importance of accurate stock price predictions for making informed financial decisions. The primarygoalofthisstudy istoprovidepracticalguidelinesfor beginners entering the field of machine learning for stock price prediction instead of just providing the knowledge of machine learning or comparing the advantages and disadvantage of the algorithm. The review encompasses various academic journals. For the selection of research papers,thestudyfocusesonpublicationsfrom2010to2023,a period marked by significant advancements in machine learning. The criteria for choosing these 24 research papers are based on their implementation of different machine learning methods for stock price prediction, along with the presence of results, data processing processes, or algorithms. Byexaminingvariousmachinelearning methodsemployedin stock price prediction and their implementation details, this review aims to distill actionable insights for newcomers. It summarizes key findings and extracts practical guidance, providing novice practitioners with a structured entry point into the world of machine learning for stock price prediction system. Additionally, the paper acknowledges the limitations of current research and suggests potential areas for future exploration, ensuring a comprehensive and informative resource for those venturing into stock price prediction using machine learning

Key Words: Stock price predictions ·Machine learning techniques·Real-timemarketdata ·Financial indicators· Predictivemodels.

1.INTRODUCTION

Thestockmarketisvitalforoureconomy,influencinghow businessesarerunandhelpingpeoplemanagetheirmoney. It can encourage companies to make better long-term decisions, but also carries significant risky [1]. Market volatility poses challenges for investors and companies, promptingresearcherstodeveloppredictivemethodstoaid wise investment decisionsandminimizelosses.Thisis an important and active area of research to minimize the financialrisk[1].

Themotivationbehindthisstudyliesintherecognitionof thegrowingimportanceofprecisestockpriceforecastsand the increasing role of machine learning in this endeavor. Withrapidadvancementsinmachinelearningtechnology, there emerges an opportunity to not only understand the theoretical underpinnings of these techniques but also to providepracticalguidanceforaspiringpractitioners.While previous research has contributed valuable insights into machinelearningforstockpriceprediction,thereremainsa distinctneedtodistillthisknowledgeintoacohesivesetof guidelines that can serve as a structured entry point for newcomersinthefield.Therefore,themainobjectiveofthis paper is to provide practical guidelines for beginners entering the field of machine learning for stock price prediction instead of just providing the knowledge of machine learning theories or comparing algorithm advantagesanddisadvantages.

This research paper aims to bridge the gap between theoreticalknowledgeandpracticalimplementationinthe domain of stock price prediction using machine learning techniques.Theprimaryobjectiveistoconductanreviewof 24 research papers published between 2010 and 2023, focusing on their implementation of diverse machine learning methods, data processing techniques, and their demonstration of results, algorithms, or data processing processes.Thesepapershavebeenmeticulouslyselectedto provideacomprehensiveunderstandingofthefield.

Thesubsequentsectionsofthisresearchpaperwillcomprise athoroughexplorationoftherelevantliterature,covering fundamentalconceptsinstockpricemarkets,theintricacies ofmachinelearning,evaluationmetrics,anddataprocessing techniques.Followingtheliteraturereview,themethodology sectionwillexplaintheapproachtakentoconductthisstudy, including detailed explanations of paper selection criteria and the process for extracting valuable information from eachselectedresearchpaper.

The discussion section will provide the findings from the reviewed research papers, synthesising of their findings, insights,andmethodologies.Finally,theconclusionsection willsummarizeofkeytakeaways,highlightthelimitationsof existingresearch,andproposepotentialavenuesforfuture exploration

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

2. LITERATURE REVIEW

This session will explore the term that related to stock, normal method for analysising the stock price based on historical data, the machine learning techniques for predictingstockpriceandsometechniquestoanalysisthe accuracyofmachinelearningmethod.

A. Stock

The stock market is a place where shares of publicly held companies are bought and sold, involving both formal exchangesandover-the-countermarketplaces.Itfacilitates interactions between allows buyers and sellers, price discovery,andservesasanindicatoroftheeconomy'shealth [2].

Thestockmarketperformsseveralvitalfunctions,suchas ensuringtransparencyinprices,maintainingliquidity,and enablingfairdealings.Itcaterstovarioustypesoftraders, includinginvestors,traders,marketmakers,speculators,and hedgers[2].

Thestockmarketholdsgreatsignificanceinafree-market economy. It allows companies to raise capital by offering sharestoinvestors.Investors,inturn,gettheopportunityto participateinacompany'sfinancialsuccess,earningprofits throughcapitalgainsanddividends.Thestockmarketalso playsacrucialroleinchannelingsavingsandinvestments into productive ventures, contributing to the overall economic growth of the country. Stockbrokers, portfolio managers,andinvestmentbankersareessentialinhelping investors navigate the stock market by facilitating stock transactionsandrepresentingcompaniesinvariousfinancial activities[2].

B. StockPriceAnalysis

Stockanalysisisamethodusedbyinvestorsandtradersto make smart decisions about buying and selling stocks. It involveslookingatpastandpresentdatatodeterminethe realvalueofastockandgainanadvantageinthemarkets. Investors use things like financial statements, stock price movements,marketindicators,andindustrytrendstohelp themmakethesedecisions.

However,stockanalysishassomelimitations.First,itrelies on historical information, and the future can be unpredictable, which makes projections uncertain. Also, some companies may not share all the important information,andanalystsmighthavebiasesthataffecttheir analysis.Plus,stockanalysisiscomplexandtime-consuming, requiringconstantmonitoringofchangingfactors[3].

Therearetwomainmethodsforperformingstockanalysis: fundamental analysisandtechnical analysis.Fundamental analysisinvolveslookingatfinancialstatements,economic reports, company assets, and market share to see how

healthyacompanyisandhowmuchitmightgrow.Onthe otherhand,technicalanalysisfocusesonpastandpresent pricemovementstopredictfuturetrends.Thismethoduses chartsandtechnicalindicatorstohelpwiththepredictions.

To answer the question which stock analysis technique is best,there'snoone-size-fits-allanswer.Differenttechniques workfordifferentinvestorsandsituations.Somepeoplelike to use a mix of fundamental, technical, and quantitative analysistomakethebestdecisions[3].

Another question people often have is how to know if a stock'spricewillgoup.It'stoughtopredictexactlywhata stockwilldo.Butinvestorscananalyzeinformationabout thestock,likeitsfairvalueandhowpeoplefeelaboutitin themarket,tomakesmarterchoices[3].

To start stock analysis, the way to begin is by gathering public information about a company, like its financial statements, news articles, and how it compares to other companies in the same industry. This can give insight on howthecompanyisdoing.

Toresearchstocksbeforebuying,it'sessentialtocollectalot of information. Some documents to research include governmentfilings,news,whatpeoplearesayingonsocial media,and,ofcourse,thecompany'sfinancialstatements.It canalsobehelpfultocheckwhatotheranalystsaresaying about the stock. This thorough research process will help formawell-informedinvestmentdecision.

C. FundamentalAnalysis

Fundamentalanalysisisacomprehensiveapproachusedto evaluate investments by examining publicly available financial data in order to determine whether a stock or security is fairly valued by the market. This analysis is conducted from a macro to micro perspective, beginning with an assessment of the overall state of the economy, followed by an evaluation of the strength of the specific industry,andculminatinginadetailedexaminationofthe financialperformanceofthecompanyissuingthestock.The main objective of fundamental analysis is to determine a reasonable market value for the stock based on its underlyingfinancialdataandgrowthpotential.

Investorsemployfundamentalanalysisforseveralreasons. Firstly, it aids in the identification of stocks that may be undervaluedorovervaluedbythemarket,offeringpotential buying or selling opportunities. Secondly, it provides valuableinsightsintothefinancialhealthofacompanyand its growth prospects, enabling informed investment decisions. Moreover, fundamental analysis facilitates a comparisonofacompany'sperformancewiththatofsimilar companiesintheindustry[4].

When conducting fundamental analysis, analysts utilize a combination of quantitative and qualitative data.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

Quantitativefundamentalsinvolvenumerical data suchas revenue, earnings, profit margins, and various financial ratios. On the other hand, qualitative fundamentals encompassaspectssuchasthecompany'sbusinessmodel, competitive advantage, quality of management, corporate governancepolicies,andprevailingindustryconditions.

Toperformfundamental analysis,analystsrelyheavilyon financialstatementslikeincomestatements,balancesheets, and cash flow statements. These financial documents provide crucial information about a company's financial performance over a specific period. Additionally, analysts mayconsultgovernmentagencyreportsonindustriesand the economy, as well as market reports, which serve as valuabletoolsintheiranalysis[4].

Below is a list of example tools used in fundamental analysis:

• Financial statements: Income statements, balance sheets,andcashflowstatements.

• Financialratios:Keymetricsderivedfromfinancial statements,suchasprice-to-earningsratio(P/E),returnon equity(ROE),anddebt-to-equityratio.

• Government agency reports: Economic indicators likeconsumerpriceindex,grossdomesticproductgrowth, andinterestrates.

• Industry analysis: Reports and metrics specific to theindustryinwhichthecompanyoperates.

• Corporategovernanceassessment:Anevaluationof a company's policies and practices with a focus on transparencyandshareholderinterests.

• Company reports and press releases: Valuable insights into the company's activities, goals, and overall performance.

D. StockTechnicalAnalysis

Technicalanalysisinvolvesusingvarioustoolsandcharting techniquestoevaluateinvestmentsbyanalyzingstatistical trends, such as price movement and trading volume. This method helps traders identify short-term trading opportunitiesandassessasecurity'sstrengthorweakness incomparisontothebroadermarket.

Professionalanalystsoftencombinetechnicalanalysiswith otherresearchmethods,whileretailtradersmayrelysolely onpricechartsandstatistics.Technicalanalysisisapplicable toanysecuritywithhistoricaltradingdata,includingstocks, futures, commodities, fixed-income securities, and currencies[5].

Some examples of technical analysis indicators and their usesinclude:

• Price trends: Identifying upward, downward, or sidewaysmovementsinprices.

• Chart patterns: Recognizing formations like head andshoulders,doubletops,ortriangles.

• Volume and momentum indicators: Analyzing tradingvolumeandpricemomentumtovalidatetrends.

• Oscillators: Indicating overbought or oversold conditionsinthemarket.

• Moving averages: Smoothing price data to reveal trendsoverspecificperiods.

• Support and resistance levels: Identifying price levelswheresecuritiestendtoreboundorstall.

Fundamental analysis differs from technical analysis as it focuses on evaluating a company's financial statements, economic conditions, and management to determine the intrinsic value of a stock. In contrast, technical analysis primarilyanalyzespriceandvolumedata,assumingthatall known fundamentals are already reflected in the stock's price.

However,technicalanalysishasitslimitations.Criticsargue that it may not always provide actionable information, similartotheweak andsemi-strongformsoftheEfficient MarketHypothesis(EMH).Historicalpricepatternsmaynot accuratelypredictfuturemovements,andrelyingsolelyon technical analysis signals cannot influence the long-term pricetrajectoryofanasset.

Inconclusion,whiletechnicalanalysisisavaluabletoolfor tradersandanalysts,itshouldbeusedinconjunctionwith otherresearchmethodstomakewell-informedinvestment decisions.

E. ShortTermInvestor

Short-term or day traders are individuals who take advantageofquickpricemovementsinfinancialassets,such asstocks.Theyaimtoprofitfromshort-termswingsinthe market,typicallyholdingpositionsforabriefperiod,often withinthesametradingday.

Daytradersneedtoactswiftlyanddecisively.Theyclosely monitor the market, execute trades promptly, and stay mindfulofpotentialrisks.Sincetheirtradesareshort-term, they rely heavily on technical analysis using various indicatorstomaketimelydecisions[6].

Here are some common technical indicators used by day traders:

• Moving Averages: Day traders often use Moving Averages (MA) to guide their trading decisions. MAs help identifytrends bysmoothingoutpricefluctuationsover a specific period. Combining different types of MAs, like simple, exponential, weighted, or smoothed, can offer valuableinsightsintopotentialentryandexitpoints[6].

• RelativeStrengthIndex(RSI):RSIisamomentum oscillator used to measure the speed and change of price movements.Itrangesfrom0to100,withreadingsabove70

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

indicating overbought conditions and readings below 30 indicatingoversoldconditions.DaytradersuseRSItospot possible buying or selling opportunities based on overboughtandoversoldlevels[6].

• Stochastics:Thismomentumoscillatorassessesthe closingprice'slocationrelativetothehigh-lowrangeovera set number of periods. It helps day traders determine overboughtandoversoldconditions,similartoRSI.Readings above 80 suggest overbought, while readings below 20 suggestoversoldconditions[6].

• Average Directional Movement Index (ADX): ADX consists of plus and minus directional indicators. It helps determinewhetheratrendisforming,whichiscrucialfor identifyingpotentialtradingopportunitiesduringbreakouts [6].

• BollingerBands:BollingerBands consistofbands placedaboveandbelowthemovingaverage.Thesebands expandandcontractwithchangingmarketvolatility.Amove outside the bands is significant and can signal potential tradingopportunities[6].

However,it'simportantto recognize thattheseindicators have limitations. They cannot predict future price movementswithcertainty,andsolelyrelyingonthemmay leadtomissedinformationorfalsesignals.Daytradersmust alsoconsiderexternalfactorslikenewseventsandmarket sentiment,whichcaninfluenceshort-termpricemovements. Additionally, high-frequency trading can increase transactioncostsandmaynotbesuitableforalltradersdue to the need for constant monitoring and quick decisionmaking. As with any trading approach, there are inherent risks, so traders should exercise caution and proper risk management.

F. LongTermInvestor

A long-term investor is someone who holds onto their investments,likestocks,formanyyears.Theyfocusonthe potentialgrowthandperformanceoftheirinvestmentsover timeratherthanshort-termmarketfluctuations.

When choosing investments, long-term investors look for strongcompanieswithgoodgrowthpotential.Theyconsider thecompany'sfinancialhealth,businessmodel,management team, competitive advantage, and industry trends before making a decision. Short-term price movements are less importanttothemastheyfocusonthecompany'slong-term prospectsandvalue.

Three simple technical indicators are commonly used by long-terminvestors:

• BollingerBands:Thesearetrendlinesdrawnabove andbelowa20-dayaverageofasecurity'sprice.Whenthe pricetouchesthebottomline,itisconsideredoversold,and when it touches the top line, it is considered overbought. Long-terminvestorsusethe20-dayaveragetodecidewhen to buy (when the price goes below) or sell (when it goes

above). It helps them visualize the "buy low, sell high" principle,andtheyfindanoversoldstrongcompanymore attractiveforlong-terminvestment[7].

• 200-DaySimpleMovingAverage: Thisisa crucial indicator forlong-terminvestorsasit representsa strong supportlevelforasecurityprice.Ifthepricefallsbelowthe 200-daymovingaverage,itmayindicatepotentialriskswith the company's financial health or undervaluation. This indicator helps long-term investors assess a company's overallstrengthandmakeinformeddecisions[7].

• RelativeStrengthIndex(RSI):RSImeasuresrecent pricechangesandshowsifasecurityisoversold(RSIbelow 30)oroverbought(RSIabove70).Long-terminvestorsuse RSIalongwithBollingerBandstoplantheirtradesbetter. Buying shares of a strong company when both indicators show it is oversold can be a good long-term investment opportunity[7].

Theseindicatorsprovidevaluableinformationtolong-term investors.BollingerBandshelpidentifyentryandexitpoints basedonpricevolatility,whilethe200-daymovingaverage actsasacrucialsupportlevel.RSIhelpsevaluateifasecurity isoversoldoroverbought,aidinginwell-timedinvestment decisions. Combining these indicators offers a more comprehensiveviewofacompany'spotentialforlong-term investors.

However, it's essential to know that technical indicators have limitations. They cannot predict future price movementswithcertainty.Long-terminvestorsshoulduse theseindicatorsalongwithfundamentalanalysisforwellroundeddecisions.Relyingsolelyonindicatorsmayleadto overlookingcriticalinformationormisinterpretingmarket signals. External factors like economic events or sudden news can also influence the stock market, making it challenging to rely solely on indicators for long-term investing.

G. MachineLearning

MachineLearningreferstocomputerlearningbystudying data, analyzing data and predicting outcomes. The way of achieving this is to use data and algorithms to imitate the wayofhumanlearning[8].

Machinelearningalgorithmsareusedtomakepredictions based on input, which is labelled or unlabeled data, algorithmswillproduceestimatesaboutpatternofdata[8]. Somecommonmachinelearningalgorithmsinclude:

1. Neuralnetwork:simulatehumanbrainwork.Good inrecognizingpatternandplayimportantroleinapplication suchaslanguagetransactionandimagerecognition[8]

2. Linearregression:predictnumericalvaluebasedon linearrelationship[8].

3. Decision trees: can be used to predict value and classifydataintocategories[8].

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

H. Regression

Regressionisapredictivemodelingtechniquethatpredicts continuous outcomes based on relationships between features and outcomes. It is widely used in supervised machinelearningforvariouspurposes,includingforecasting trendsandoutcomes.Representativelabeledtrainingdatais crucial for accurate predictions. Common uses include predictinghouseprices,stockprices,andanalyzingdatasets forinsights[9].

I. LinearRegression

Linearregressionisapopularandstraightforwardmachine learning technique used for making predictions. It establishes a straight-line relationship between a target variable(y)andoneormoreinputvariables(x).

Themaingoaloflinearregressionistofindthebest-fitting line that reduces the difference between predicted and actualvalues.Thisisachivedusingacostfunction,usually MeanSquaredError(MSE),tomeasurehowwellthemodel performs.Themodel'scoefficients(a0anda1)areadjusted throughGradientDescenttooptimizethecostfunction.

To assess the model's accuracy, R-squared is employed, whichindicateshowwellthelinefitsthedatapointsandthe strengthoftherelationshipbetweenthevariables.Toensure reliable results, linear regression relies on certain assumptions,suchasalinearrelationshipbetweenvariables, minimal multicollinearity(highcorrelation betweeninput variables), uniform error distribution (homoscedasticity), normaldistributionoferrorterms,andnoautocorrelations in error terms. Meeting these assumptions is vital for creatinganeffectivelinearregressionmodel[10].

However, Linear regression has both advantages and disadvantages. On the positive side, it is a simple and computationally efficient model for expressing the relationshipbetweenpredictorvariablesandthepredicted variable. The output of linear regression is interpretable, allowing us to understand the relative influence of predictors on the target variable when predictors are independent[11].

However,therearelimitationstoconsider.Linearregression is overly simplistic and may struggle to capture complex real-world relationships. It assumes a linear relationship betweenpredictorandpredictedvariables,whichmaynot alwaysholdtrue.Outlierscansignificantlyimpactthemodel, leadingtolessreliableresults.Additionally,linearregression assumes independence among predictor variables, which canbechallengingtomeetinpractice.Highmulticollinearity amongpredictorscanresultinunreliablemodelweightsand makesitdifficulttodeterminefeatureimportanceaccurately [11].

J. PolynomialRegression

Polynomial Regression is a type of linear regression used whentherelationshipbetweenvariablesisnotastraightline

but shows a curved pattern. It models this curved relationshipbyusinghigher-orderpolynomialtermsofthe independentvariable.

We use polynomial regression in scenarios when the straight-line model inadequately fits the data due to its curvednature.Whenapplyingalinearmodeltocurveddata, thescatterplotofresidualsshowspatternsofpositiveand negativeresiduals,indicatinganon-linearmodelmightbe better.Theassumptionofindependenceamongindependent variablesisviolated[12].

InPython,polynomialregressioncanbeimplementedusing librarieslikeNumPy,Pandas,Matplotlib,andScikit-learn.By addinghigher-ordertermsoftheindependentvariableinthe feature space, we construct a polynomial model suited to non-lineardatapatterns.

Polynomial regression finds application in various realworldcaseswheredataisnon-linear,likemodelinggrowth rates,diseaseprogression,anddistributionpatterns[12].

Tofitthepolynomialregressionmodel,weuseScikit-learn's PolynomialFeaturesclassto transform theinput data into polynomialfeatures,followedbyLinearRegressiontofitthe model[12].

However,cautionisnecessarytoavoidoverfitting,wherethe model becomes too complex and doesn't perform well on new data. Regularization techniques like Lasso and Ridge regressioncanhelppenalizemodelcomplexityandprevent overfitting.

K. Multipleregression

Multiple linear regression is a statistical method used to predicttheoutcomeofadependentvariablebyconsidering two or more independent variables [13]. It helps analysts understand how each independent variable impacts the overall variance of the model. Multiple regression can accommodatebothlinearornon-linearrelationship.

In multiple linear regression, the formula involves the dependentvariable(yi),regressioncoefficients(β0,β1,β2, βp)representingtheeffectofeachindependentvariable,and arandomerrorterm(ϵ).Themainobjectiveistoestablisha relationship between the dependent variable and the independentvariables.

Thereareseveralassumptionsthatmustbemetforaccurate resultsinmultiplelinearregression:

• LinearRelationship:Itassumesalinearrelationship betweenthedependentand independentvariables,which canbecheckedusingscatterplots.

• No Multicollinearity: The independent variables shouldnotbehighlycorrelated,asthiscancreatedifficulties inidentifyingthespecificvariableaffectingthedependent variable.

• Homoscedasticity:Theerrorintheresidualsshould haveaconsistentvarianceacrossthelinearmodel.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

• Independence of Observations: Each observation shouldbeindependentofothers,andtheresiduals'values shouldnotbecorrelated.

• MultivariateNormality:Theresidualsshouldfollow a normal distribution, which can be checked using histogramsornormalprobabilityplots[14].

Followingtheseassumptionsensuresthereliabilityof the multiple linear regression model and helps in making accuratepredictions.

L. Bias-VarianceTradeoffinRegression:Lasso,Ridge, andElasticNet

Lasso,Ridge,andElasticNetare threedistinct techniques used in machine learning to address the bias-variance tradeoff. Bias represents the model's underlying assumptionsthatsimplifythetargetfunction,whilevariance pertainstothemodel'ssensitivitytosmallfluctuationsinthe data.

A model with high bias tends to make more assumptions, leading to underfitting, while high variance causes overfittingbycapturingnoiseandoutliersfromthetraining data.

To address this tradeoff, we have three regression techniques:

Ridge Regression: Ridge Regression introduces a penalty term equal to the square of the coefficients into the cost function. By controlling this penalty through a parameter lambda,themodelcanreducethemagnitudeofcoefficients to zero. This results in higher bias but lower variance, makingitsuitablefordecreasingmodelcomplexitywithout reducingthenumberofvariables[15].

LassoRegression:LassoRegressionincorporatesapenalty termequaltotheabsolutesumofthecoefficientsintothe costfunction.Asthecoefficientvaluesincrease,thispenalty encourages the model to shrink certain coefficients to absolute zero. Lasso is particularly useful for feature selection,asitcansetsomecoefficientstozero,effectively disregarding less important features. However, it may encounterchallengeswithcollinearvariables[15].

ElasticNet:ElasticNetcombinestheregularizationofboth LassoandRidge.ItprovesusefulwhenLassointroducesa slight bias, making the model too reliant on specific variables. By utilizing Elastic Net, we can harness the benefits of both Lasso and Ridge without their respective limitations[15].

Eachoftheseregressiontechniqueshelpsstrikeabalance between bias and variance, thereby enabling the constructionofmorerobustandaccurateregressionmodels.

M. ARIMA

ARIMA,whichstandsforAutoregressiveIntegratedMoving Average, and it is a statistical model used for time series forecasting. Time series data is a series of data points

collectedatsuccessivetimeintervals,suchasstock prices overdays,months,oryears.

TheARIMAmodelconsistsofthreemaincomponents: Autoregressive(AR)Component(p):Thiscomponentuses past values of the time series to predict future values. It assumesthatthefuturevalueofthetimeseriesislinearly dependent on its past values. The "p" in ARIMA(p, d, q) representsthenumber oflagged observationsusedin the model.

Integrated (I) Component (d): This component involves differencing the time series to make it stationary. A stationary time series has constant statistical properties overtime,makingiteasiertopredict.The"d"inARIMA(p,d, q) represents the number of times the differencing is performed.

MovingAverage(MA)Component(q):Thiscomponentuses past forecast errors in a regression-like model to predict future values. It assumes that the future value of the time seriesisrelatedtothepastforecasterrors.The"q"inARIMA (p, d, q) represents the number of lagged forecast errors usedinthemodel.

Whencombined,thesethreecomponentsformtheARIMA(p, d,q)model,whichiscapableofhandlingdifferenttypesof timeseriesdataandmakingaccuratepredictionsbasedon historicalpatterns.

ARIMAmodelsfindextensiveapplicationsacrossindustries for demand forecasting, stock price prediction, economic analysis, and more. They are effective for short-term predictionsandcanhandlenon-stationarytimeseriesdata. However, ARIMA models also have limitations. They may struggle to predict turning points in the data, and determiningtheappropriatevaluesof"p,""d,"and"q"often involves some trial and error or expert judgment. Additionally,ARIMAmodelsmaynotperformwellforlongterm forecasts or time series data with seasonal patterns [16].

N. KNN

K-NearestNeighbor(KNN)isastraightforwardandwidely used machine learning algorithm that operates on the principleofSupervisedLearning.Itisparticularlyusefulfor classificationtasks,asitcategorizesanewdatapointbased onitssimilaritytotheavailablecategories.Onesignificant advantage of KNN is its simplicity in implementation. It requires minimal parameter tuning and can be quickly appliedtovariousdatasets.

AnotheradvantageofKNNisitsrobustnesstonoisytraining data. Since KNN relies on the proximity of data points to makepredictions,isolatednoisydatapointsarelesslikelyto influencetheoverallclassification.Thischaracteristicmakes KNNsuitablefordealingwithdatasetscontainingoutliersor noise.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

Moreover, KNN can be effective when the training data is large.Sinceitstoresall availabledata,itdoesn'trequire a lengthytrainingprocessandcanquicklyadapttonewdata pointswithoutretrainingthemodel.ThisfeaturemakesKNN efficient in scenarios where the dataset is continuously growingorbeingupdated.

However,therearecertainconsiderationstokeepinmind whenusingtheKNNalgorithm.Onecrucialfactorisselecting thevalueof"K."Theappropriatechoiceof"K"isessentialto achievingoptimalperformance.Setting"K"toolow,suchas "K=1"or"K=2,"mightleadtooverfitting,makingthemodel sensitivetonoiseandoutliersinthedata.Ontheotherhand, large"K"valuescouldleadtounderfitting,wherethemodel may lose important patterns and result in inaccurate predictions.

AnotherdisadvantageofKNNisthehighcomputationcost, especiallywithlargedatasets.Toclassifyanewdatapoint, KNNneedstocalculatetheEuclideandistancetoalltraining samplesandselectthe"K"nearestneighbors.Thisprocess becomescomputationallyexpensiveasthesizeofthedataset increases, potentially making the algorithm inefficient for real-timeorresource-constrainedapplications[17].

MovingAverage

The Moving Average (MA) method is a widely used time series forecasting technique that plays a crucial role in smoothening out fluctuations and identifying long-term trendswhilereducingtheimpactofshort-termvariations. Its applications span in diverse fields, such as stock price prediction, economic forecasting, and pandemics analysis likeCOVID-19[18].

Themethodworksbyusingaslidingwindowoffixedwidth (w)thatmoveswithaspecifiedstrideoverthetimeseries data. Within this window, the average of data points is computed, and the original data points are replaced with their respective average values, resulting in a new series withreducedfluctuationsandnoise. Thereareseveraltypesofmovingaveragesexist.TheSimple MovingAverage(SMA)calculatesthestandardmeanofthe values within the sliding window, while the Weighted MovingAverage(WMA)assignsweightstoeachdatapoint, givingmoreimportancetorecentvalues.TheExponential MovingAverage(EMA)isaspecialcaseofWMAthatapplies smallerexponentialweightstooldervalues,thusprioritizing recenttrends.

OnesignificantadvantageoftheMovingAveragemethodis itsspeedandcomputationalefficiency,makingitsuitablefor handling large datasets and real-time forecasting applications.Additionally,itiseasytoupdatethemodelwith new data points without complicating the prediction process. Moreover, MA models are interpretable and explainable, enabling stakeholders to understand the model'sfunctioningandcustomizeittosuitspecificbusiness needs[18].

However, there are certain considerations and limitations when using the Moving Average method. To provide accurate forecasts, a sufficient number of samples are needed to establish a reliable trend. The method may not effectivelycapturepattern-basedlong-termtrends,limiting itsabilitytopredictfarintothefuturewithoutretrainingthe model. Unlike other machine learning models, MA cannot identifyrelationshipsbetweenvariablesandassigncustom weightstofeaturesbasedontheirimportance[18].

P. RNN

ARecurrentNeuralNetwork(RNN)isaspecializedtypeof neuralnetworkdesignedtohandlesequentialdataliketimeseries and text data. Unlike traditional neural networks whereinputsandoutputsaretreatedindependently,RNNs incorporate a hidden layer that enables them to retain informationfrompreviousstepsinthesequence[19].

ThearchitectureofanRNNissimilartootherdeepneural networks,consisting ofinputandoutputlayers.However, thekeydistinctionliesinhowinformationisprocessedand flows from input to output. In an RNN, the same set of weightsisusedacrossalltimesteps,andthehiddenstateat each step is updated based on the current input and the previoushiddenstate[19].

Thehiddenstate(ht)atagiventimestepiscomputedusing theformula:h=σ(UX+Wh-1+B),wherehrepresentsthe currenthiddenstate,Uistheweightmatrixforthecurrent input (X), W is the weight matrix for the previous hidden state(h-1),andBisthebiasterm.

Theoutput(Y)ateachtimestepiscalculatedusing:Y=O(Vh +C),whereYdenotestheoutputatthecurrenttimestep,V representstheweightmatrixfortheoutputlayer,andCis thebiasterm.

ThedistinctiveadvantageofRNNsliesintheirhiddenstate, which enables them to remember information from past inputs, making them adept at handling sequential data effectively.Moreover,themodel'sparameters(W,U,V,B,C) are shared across all time steps, reducing the complexity comparedtootherneuralnetworks[19].

RNNs are trained using Backpropagation Through Time (BPTT),avariationoftheBackpropagationalgorithmthat computes and updates gradients over all previous time steps.

TherearedifferenttypesofRNNsbasedonthenumberof inputsandoutputsinthenetwork:

• OnetoOne:Similartoasimplefeedforwardneural network,ithasoneinputandoneoutput.

• OnetoMany:Itgeneratesmultipleoutputsbasedon oneinput,suchasimagecaptioning.

• ManytoOne:Itproducesasingleoutputbasedon multipleinputs,commonlyusedinsentimentanalysis.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

• Many to Many: Both multiple inputs and multiple outputsareutilized,frequentlyusedinlanguagetranslation.

To overcome challenges like vanishing and exploding gradients,advancedversionsofRNNshavebeenintroduced. Notable variations include Bidirectional Neural Networks (BiNN)andLongShort-TermMemory(LSTM).BiNNallows information to flow in both directions, valuable for tasks where context is crucial, like natural language processing. LSTM incorporates gates to selectively read, write, and forget information, effectively handling long-term dependencies[19].

Q. GraphNeuralNetworks

AGraphisadatastructurethatrepresentsasetofobjects (nodes)andtheconnectionsbetweenthem(edges).Itserves as a powerful tool to model complex relationships and interactionsamongdifferententities[22].

GraphNeuralNetwork(GNN)isaspecializeddeeplearning technique designed to process non-Euclidean structured data.Non-Euclideandatareferstodatathatlacksfixedsize or dimensionality, making it challenging to analyze using traditional deep learning methods that work well with Euclideandata,suchasimageswithfixeddimensions.

ThekeyconceptbehindGNNistoworkwithgraphsusing deep learning principles. GNNs exploit the inherent graph structure to perform computations and make predictions basedontherelationshipsbetweenthenodes.

Several mainstream models of Graph Neural Networks include:

• GraphConvolutional Network (GCN):GCN isbuilt uponspectralmethods,whicharecloselyrelatedtograph signal processing.Itleveragesthe convolution theorem to transformsignalsbetweenthetimeandspectraldomains, enabling computations on the graph. Nonlinear activation functionsareappliedtotheaggregatedresults,andmultiple layersarestackedtoformaneuralnetwork[22].

• GraphRecurrentNetwork(GRN):GRNtransforms thegraphdataintoasequenceandallowsnodestoexchange information with neighboring nodes iteratively until reachingastablestate[22].

• GraphAttentionNetwork(GAT):GATissuitablefor sequentialtasksandexcelsinhandlinggraphswithvarying sizes.Itfocusesonthemostcrucialelementsofinputdata and uses attention mechanisms to emphasize relevant informationfromneighboringnodes[22].

R. LSTM

LSTMstandsforLongShort-TermMemory,anditbelongsto thecategoryofrecurrentneuralnetworks(RNNs)[20].The primarypurposeofLSTMistoovercomethelimitationsof conventionalRNNs,whichstrugglewithlearninglong-term dependencies due to issues like vanishing or exploding gradients.

In contrast, LSTM networks are specifically designed to handle long-term dependencies and accurately represent sequencesinchronologicalorder.Thedistinguishingfeature ofLSTMisitsinternalcelldesign,comprisingthreelogistic sigmoidgatesandaTanhlayer.Thesegatescontroltheflow of information, enabling the network to decide what informationtoretainandwhattodiscard.

ThearchitectureofanLSTMincludesahiddenlayerwitha gated unit or cell. Each LSTM cell takes three inputs: the present information, the previous hidden state, and the previouscellstate.Itproducestwooutputs:thehiddenstate andthecellstate.Theforgetgate,oneofthesigmoidlayers, plays a crucial roleindetermininghowmuchinformation from the previous cell state should be retained for the currentstep.

LSTM networks have found applications in various fields, including text generation, image processing, speech and handwriting recognition, music generation, and language translation. Before using LSTM models in real-world applications,theyneedtobetrainedonappropriatedatasets [20].

However,LSTMscomewithcertaindrawbacks.Theycanbe computationally intensive and demand high memory bandwidth.Researchersareactivelyworkingondeveloping models capable of storing past data for even longer durations. Overfitting is another challenge with LSTMs, making it difficult to implement dropout effectively to addressthisissue[20].

S. Measureoftheaccuracyofthemachinelearning

Accuracy is a widely used metric in Machine Learning for evaluatingclassificationmodels.Itmeasuresthepercentage of correct predictions made by a model out of the total numberofpredictions.Theaccuracyformulaiscalculatedby dividing the number of correct predictions by the total numberofpredictionsmade[21].

In simpler cases, accuracy is easy to understand and implement,makingitapopularchoiceformodelevaluation. However,inreal-lifescenarios,machinelearningproblems are often more complex. Issues like imbalanced datasets, multiclass or multilabel classification, and differing objectives can make accuracy less suitable as the sole evaluationmetric.

TheAccuracyParadoxillustratesacommonproblemwith accuracy when dealing with imbalanced datasets [21] . A highaccuracyscoremaybemisleadingifthemodelperforms poorly on minority classes. For instance, in medical diagnosis,misclassifyingseriousillnessescanhavesevere consequences,eveniftheoverallaccuracyseemshigh.

To address these limitations, alternative metrics such as precision, recall, F-score, and confusion matrix can be utilized. These metrics provide insights into the model's performanceataclasslevel,helpingtoidentifyweaknesses in specific areas. In multiclass and multilabel problems,

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

different accuracy formulas account for the complexities inherentinthesescenarios.

For multilabel problems, metrics such as Hamming Score andHammingLossarerelevantmetricswhereclassescan have multiple labels and may not be mutually exclusive. SubsetAccuracyisanothermetricthatrequiresalllabelsto matchexactlyforagivensample,makingitsuitableforstrict classificationtasks.

Ultimately,selectingappropriatemetricsshouldalignwith thespecificproblem,businessrequirements,andworkflow toeffectivelymeasurethemodel'sperformance.

T. Accuracyscore

An accuracy score serves as a metric to assess the performanceofa classificationmodel inmachinelearning [23].Itdenotestheproportionofcorrectpredictionsmade by the model on a given dataset. The simplicity of the accuracyscore'scalculationandinterpretationhasledtoits widespreadusage,providingasinglenumericalvaluethat reflectsthemodel'sabilitytomakeaccuratepredictions.

Todeterminetheaccuracyscore,yourequiretwoessential components:

• Ground Truth Classes: These correspond to the actual class labels assigned to the data points within the dataset,representingthetruevalues[23].

• PredictionsMadebytheModel:Thesearetheclass labels predicted by the model for the corresponding data pointsinthedataset[23].

Theformulaforcomputingaccuracyisstraightforward:

Accuracy=Numberofcorrectpredictions/Totalnumberof predictions

Alternatively,amoreformalrepresentationinvolvesusing TruePositive(TP),FalsePositive(FP),TrueNegative(TN), andFalseNegative(FN)valuesfromtheconfusionmatrix:

Accuracy=(TP+TN)/(TP+FP+TN+FN)

U. LogarithmicLoss

Logarithmicloss,alsocalledloglossinsomeoftheresearch papers, is a widely employed error metric in the realm of applied machine learning. Its purpose is to assess the accuracy of a model's predictions by comparing the predictedprobabilitiestotheactuallabels[24].Thelogloss values are confined to the range of zero to one, with zero signifyingaperfectalignmentbetweenpredictionsandtrue labels. When dealing with multi-class problems, log loss typicallyexhibitshighertolerancelevelscomparedtobinary classificationtasks.

Thisevaluationmetricisparticularlywell-suitedforbinary classifiers, which are systems designed to distinguish between two outcomes, such as distinguishing spam from non-spam emails. In this context, lower log loss values

indicatemoreaccuratepredictions,whereashigherlogloss valuessuggestanelevatedriskofmisclassification.

To calculate log loss accurately, users must define the probabilitiesassociatedwitheachclassbeforeapplyingthe log loss function. The formula entails computing the logarithm of the corrected probabilities and subsequently determiningthenegativeaverageoftheselogarithms[24].

While log loss holds substantial importance for binary classifiers, it may not be the most appropriate metric for complex multiclass classification tasks. This is due to its label-dependent nature, rendering it less precise in such scenarios[24]

The efficacy of machine learning is also reliant on data processingmethods,whichhavethepotentialtosimplifythe classificationprocessandminimizelogloss[24].

V. ConfusionMatrix

Theconfusionmatrixisatoolusedinmachinelearningto evaluate the performance of classification models. It represents a table with four different combinations of predictedandactualclasslabels[25]

To understand the confusion matrix, let's consider an analogyrelatedtopregnancy:

• TruePositive(TP):

Interpretation:Themodelpredictedpositive,andit'strue. Analogy:Themodelpredictedthatawomanispregnant,and sheis.

• TrueNegative(TN):

Interpretation:Themodelpredictednegative,andit'strue.

Analogy:The model predictedthata manis not pregnant, andheisnot.

• FalsePositive(FP)-Type1Error:

Interpretation:Themodelpredictedpositive,butit'sfalse. Analogy:Themodelpredictedthatamanispregnant,buthe isnot.

• FalseNegative(FN)-Type2Error:

Interpretation:Themodelpredictednegative,butit'sfalse. Analogy:Themodelpredictedthatawomanisnotpregnant, butsheis.

Intheconfusionmatrix,thepredictedvaluesaredescribed asPositiveorNegative,whiletheactualvaluesaredescribed asTrueorFalse.

The confusion matrix is valuable for computing various performancemetrics,suchas:

• Recall:Itcalculateshowmanyofthepositiveclass instanceswerepredictedcorrectly.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

• Precision:Itcalculateshowmanyofthepredicted positiveclassinstanceswereactuallypositive.

• Specificity: Another term for True Negative Rate, measuringhowwellthemodelidentifiesnegativesamples.

• Accuracy:Itmeasurestheoverallcorrectnessofthe model'spredictions.

• AUC-ROC (Area Under the Receiver Operating Characteristic)curve:Itprovidesanaggregatemeasureof modelperformanceacrossallclassificationthresholds.

Tocalculatetheconfusionmatrixfora2-classclassification problem,actualclasslabelsandthepredictedclasslabelsare neededtobeabletocomparethemtodeterminetheTP,TN, FP,andFN[25].

W. AUC-ROCcurve

The AUC-ROC (Area Under the Receiver Operating Characteristic) curve is a performance metric used in machinelearning to evaluate classification models [26]. It representsthemodel'sperformanceatdifferentthreshold valuesbyplottingtheTruePositiveRate(TPR)againstthe False Positive Rate (FPR). TPR is the ratio of correctly predicted positive instances, while FPR is the ratio of incorrectlypredictedpositiveinstances[26].

AUC,whichstandsforAreaUndertheROCCurve,calculates the two-dimensional area under the entire ROC curve, ranging from (0,0) to (1,1). It measures the model's performance across different thresholds and provides an aggregate measure of its predictive power. A higher AUC valueindicatesbettermodelperformance,withavalueclose to1suggestingagoodabilitytodistinguishbetweenpositive andnegativeinstances.

AUC-ROCisusefulincaseswheretherankingofpredictions matters more than their absolute values, making it scaleinvariant. Additionally, it evaluates model performance without considering the specific classification threshold used,makingitclassification-threshold-invariant.

However,AUC-ROCisnotrecommendedwhenwerequire calibratedprobabilityoutputsfromthemodelorwhenthere aresignificantimbalancesinthecostsoffalsenegativesand falsepositives.Insuchcases,otherevaluationmetricsmight bemoreappropriate.

AlthoughAUC-ROCisprimarilyusedforbinaryclassification problems, it can be adapted for multi-class classification usingthe One vs.All approach[26]. This methodinvolves constructingseparateAUC-ROCcurvesforeachclassagainst the rest, enabling effective evaluation of the multi-class model'sperformance.

X. MeanAbsoluteError

MeanAbsoluteError(MAE)isastatisticalmetricutilizedto evaluatetheaccuracyofpredictionsinregressionmodels.It computestheaveragemagnitudeoferrorsbymeasuringthe absolute difference between predicted values and actual

values. MAE assesses errors without regard to their direction, which enhances its robustness, particularly for datasetscontainingoutliers[27].

TheformulaforMeanAbsoluteErrorisasfollows:

MAE=(1/n)Σ(i=1ton)|y_i –ŷ_i|

Where:

nisthenumberofobservationsinthedataset.

y_iisthetruevalueofthetargetvariable(theactualvalue).

ŷ_iisthepredictedvaluebytheregressionmodel.

MAE is a linear score, giving equal weight to all errors, facilitatingmodelcomparisonandinterpretation.Itiswidely used in various disciplines like finance, engineering, and meteorology due to its simplicity and ability to provide valuableinformationaboutpredictionerrors.

Y. MeanSquareError

Mean Squared Error (MSE) is a statistical metric used to evaluate the performance of a regression model [27]. It calculatestheaverageofthesquaredvariancesbetweenthe actualvaluesandthemodel'spredictions.TocomputeMSE, one subtracts the actual values from their corresponding predictedvalues,squaresthesedifferences,computestheir mean,thusobtainingasinglenumericalrepresentation.

TheformulaforMSEisasbelow:

MSE= Σ[(Actual-Predicted)^2]/N,

where:

Σdenotesthesumofsquareddifferences

"Actual"istheactualvalue

"Predicted"isthepredictedvalue

Nisthetotalnumberofdatapoints

MSEservesseveralpurposes:itassessesforecastaccuracy, handlespositiveandnegativeerrorsequally,andissensitive tolargedeviationsoroutliers.AlowerMSEindicatesamore accuratemodel.Inthegivenexampleoficecreamdemand forecasts, the calculated MSE was approximately 4.67, indicatingtheforecastmodel'sperformance.

Additionally,RootMeanSquaredError(RMSE),thesquare rootofMSE,isoftenusedforeasierinterpretationsinceit shares the same unit as the original data. In regression analysis,othermetricslikeMeanAbsoluteError(MAE)and R-squared (R2) are also used for model evaluation, each havingtheirownstrengthsandweaknesses.Thechoiceof theappropriatemetricdependsonthespecificdatasetand theproblemathand[27].

Z. F1Score

TheF1scoreisawidelyusedevaluationmetricinmachine learning that combines precision and recall measuring a model's accuracy [28]. Unlike accuracy, which evaluates

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

overall correct predictions, the F1 score focuses on classwiseperformance,makingitvaluableforclass-imbalanced datasets.Precisionmeasurestheproportionoftruepositive predictions among positive predictions, while recall measurestheproportionofcorrectlyidentifiedpositiveclass samples.

TheF1scoreiscalculatedastheharmonicmeanofprecision and recall, giving equal importance to both metrics. This makes it suitable for situations where maximizing both precisionandrecallsimultaneouslyisessential.TheF1score rangesfrom0to100%,withhighervaluesindicatingbetter classifierperformance.

To calculate the F1 score, a confusion matrix with True Positives(TP),FalsePositives(FP),TrueNegatives(TN),and False Negatives (FN) is required. The formula for the F1 score is: F1 Score = 2 * (Precision * Recall) / (Precision + Recall).

Formulti-classdatasets,therearedifferentapproachesto computetheF1score:

Macro-averagedF1Score:Simpleaverageofclass-wiseF1 scoresfordatasetswithequalclasssamples.

Micro-averaged F1 Score: Metric for multi-class data distributionsusingnetTP,FP,andFNvalues.

Sample-weighted F1 Score: Ideal for class-imbalanced datasets, calculating a weighted average based on class samples.

Additionally,there'stheFβscore,ageneralizedversionof theF1scorewhereβisauser-definedweightingcoefficient, allowingprioritizationofprecisionorrecall[28].

In Python, you can easily calculate the F1 score using the "f1_score" function from scikit-learn. The "classification_report" function provides a comprehensive listofmetrics,includingclass-wiseandaveragemetrics.The F1 score is a valuable tool for evaluating classifier performanceandiscommonlyusedinclassificationtasks.

AA.Dataanalysisandmachinelearning

Dataprocessingistheessentialtaskofconvertingdatafrom oneformintoa moreuseful format.Thisprocessinvolves cleaning, transforming, and preparing data for analysis. Machinelearning,math,andstatisticsareusedtoautomate thisprocess.Theresultscanbeshowninvariousformslike graphs,videos,andtables[29].

Data processing is crucial in machine learning because it makes data ready for building models. The main steps includecollectingdata,cleaningitup,analyzingit,making sense of the analysis, and storing it securely. Finally, the resultsarepresentedinaneasy-to-understandway[29].

To get good results in machine learning, you need highqualityandaccuratedata.Collectingdatacanbeexpensive and time-consuming. Organizations and researchers must decidewhatdatatheyneedcarefully[29].

Data preparation involves getting data from different sources,analyzingit,andcreatinganewdatasetforfurther work. Sometimes, data is turned into numbers for faster learningbymodels[29].

Thedatamightnotbeeasyformachinestoread,soxleaning, filtering,andtransformingthedatatomakeitsuitablefor analysisisrequired[29].

Processing data involves using algorithms and machine learning techniques to follow instructions over a large amountofdatawithaccuracyandefficiency[29].

Theoutputstageprovidesmeaningfulresultsthatareeasy foruserstounderstand.Theseresultscanbeintheformof reports,graphs,orvideos[29].

Datacleaningisacriticalpartofmachinelearning.Ithelps ensurethedataisaccurateandconsistent.Removingerrors and inconsistencies is important because they can affect modelperformance[30].

Datacleaningstepsincludeinspectingthedata'sstructure, checking for duplicates, and handling missing values. It's importanttoremoveunnecessaryorirrelevantobservations [30].

Handlingmissingdataisacommonchallenge.Therearetwo main ways to deal with it: removing observations with missing values or imputing missing values from past data [30].

Outliers,whichareextremevaluesthatdiffersignificantly fromthemajorityofdata,needtobeaddressedaswell.They cannegativelyimpactanalysisandmodelperformance[31].

Data transformation means converting data into a format that's suitable for analysis. Techniques like normalization andscalingareusedtotransformdata[31].

Normalization and scaling are important to ensure that features with different scales do not affect the model's performance. This step helps make sure all features are treatedequally[31]..

3. RESEARCH METHODOLOGY

Themethodologysectionoutlinestheapproachusedin this research to review existing literature on stock price predictionusingmachinelearningtechniques.Theobjective of this study is to distill valuable insights and generate relevant guidelines as a starting point for practitioners entering the field of machine learning for stock price prediction.Theselectionof24researchpapersfromIEEE, published between 2010and 2023, forms the basis ofthis methodology.

To ensure the quality and relevance of the selected research papers, a stringent set of criteria were employed during the paper selection process. The primary inclusion criteriawereasfollows:

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

• Research papers must be peer-reviewed and publishedinIEEEjournalsorconferenceproceedings.

• The papers must focus on stock price prediction usingmachinelearningmethodologies.

• Papers must include some or all the practical implementations, data processing details, algorithms, and resultsrelatedtostockpricepredictionasverificationofthe effectivenessoftheresearchdonebytheauthors.

Theselectionof24researchpaperswasmotivatedbythe needtocompileacomprehensiveanddiversesetofliterature that encompasses various approaches, data sources, data processingmethods,andalgorithms.Thisselectionstrategy ensuresthatthefinalguidelinereflectsabroadspectrumof machinelearningtechniquesandpracticalapplications.

Thefinalstepofthemethodologyinvolvedsynthesizing theinformationobtainedfromtheanalyzedresearchpapers togenerateacomprehensiveguideline.Thisguidelineserves asavaluableresourceforindividualslookingtoinitiatetheir journey in machine learning for stock price prediction. It includes practical recommendations, best practices, and potentialpitfallstohelpbeginnersnavigatethecomplexities ofthisfield.

4 RESULTS AND DISCUSSIONS

This session will discuss several case studies of existing papersonusingmachinelearningtopredictthestockprice, itsbenefitsandweaknessesandaccuracy.

1. Thisresearch[32]named‘StockPricesPredictionUsing Machine Learning’ focuses on predicting stock prices usingmachinelearningtechniques,particularlySupport VectorRegression(SVR)andLong-ShortTermMemory (LSTM).Thestudyinvolvescollectingdailystockprice data from five companies (Amazon, Google, Tesla, Netflix,Facebook)between2015and2020.Thedatais organized,cleaned,andusedtoconstructandtestSVR andLSTMmodelswitha100-daytrainingperiod.Model performance is evaluated using Root Mean Squared Error(RMSE).TheresultsindicatethatLSTMgenerally outperforms SVR in predicting stock prices, although SVR with the radial basis function (RBF) kernel performs best for Google. The research does not explicitly discuss the suitability of these methods for short-termorlong-terminvestors,butduetothedaily prediction focus and data range, they appear more relevantforshort-terminvestmentstrategies.

2. Thisresearch[33],whichname"StockPredictionand analysis Using Supervised Machine Learning Algorithms” focuses on using Supervised Machine Learningalgorithmstopredictstockprices,especiallyin thecontextofthepandemicanditsimpactontheIndian stock market. They've tried different methods like RandomForest,DecisionTree,andLogisticRegression tomakethesepredictions.

Fig -1:Stepofmodeldevelopmentinresearch[33]

Thestepofresearchincludes,first,theygotabunchof data from Kaggle, which had information about stock prices,likewhentheyopened,theirhighestandlowest points, how much they were traded, and more. Then, theyusedthesealgorithmstoteachthecomputerhow topredictstockprices.Figure1showstheflowchartof the step taken, which includes understanding the business problem, data acquisition, data cleaning, exploratorydataanalysis,machinelearningalgorithm, predictthemodelaccuracyanddeploymentofmodel. The results: one of the methods they used (Logistic Regression)gotanaccuracyscoreof52%,andtheother one(DecisionTree)didbetterwith83%.Buttheydidn't give more detailed figures about how well these predictions worked. However, detailed performance metricsbeyondaccuracywerenotprovided.Thestudy didnotspecifywhetherthesemethodsaresuitablefor short-term or long-term stock investment strategies. Additionally,itdidnotclarifythetimeperiodcoveredby thepredictionsorthetimingoftheresearchitself.The research also did not conclusively determine the effectiveness of these methods in predicting stock prices,leavingseveralquestionsunanswered.

3. Thisresearch[34]named‘AnalysingtheTrendofStock Marketand Evaluate the performance of Market PredictionusingMachineLearningApproach’focuseson predicting stock market values with different method with different stock. Firstly, data is gathered from differentsources,includingadatabasecalledQuandl,to obtainhistoricalstockinformation.Then,thehistorical stockdataisprocessedtocreateadatasetsuitablefor analysis. Next, the data is divided into categories like HighOpen,HighClose,andAverageMovement,making it easier to work with. Different prediction methods, includingSupportVectorMachine,RandomForest,and NeuralNetwork,aretestedandcomparedforaccuracy. The results show that the Neural Network using the Levenberg-Marquardtmethodisthemostaccurate,with areportedaccuracyof94.17%.

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

4. Theresearchpaper[35]named‘AnalysisofStockPrice PredictionusingMachineLearningAlgorithms’aimsto predict stock prices for Reliance Industries Limited (RIL) using machine learning and deep learning techniques.Thestudyfollowedseveralsteps:first,they collectedstockpricedatafromNovember11th,2020,to November10th,2021,fromtheNationalStockExchange of India. Next, they cleaned the data by removing any missinginformationandfocusedsolelyontheclosing prices.Toevaluatetheirpredictions,theysplitthedata intotwoparts:onefortrainingthemodels(80%)and the other for testing their accuracy (20%). They employedthreepredictionmodels:LinearRegression, Auto-ARIMA, and LSTM (Long Short-Term Memory). Linear Regression was used to predict stock prices based on various features. Auto-ARIMA helped with time series forecasting and calculated the Root Mean SquaredError(RMSE)toassessaccuracy.LSTM,adeep learningmodel,aimedtocapturelong-termpatternsin the data. The researchers found that the LSTM-based model, which used one-week historical data, was the mostaccurateforpredictingRIL'sclosingpricesovera 25-dayperiod.Whilethisresearchprimarilyfocuseson short-term predictions, covering the mentioned date range, it does not provide explicit guidance for longterminvestors.

5. The research [36] named ‘Prediction of Stock Prices using Machine Learning (Regression,Classification) Algorithms’ focuses on predicting stock prices using machine learning techniques, specifically employing regression and classification algorithms. The step-bystep process includes data collection from Yahoo Finance, where historical stock data for companies within the S&P500 index was obtained. Data preprocessing was performed to extract relevant featureswhichismomentumandvolatility.Thedataset wassplitintotrainingandtestsetsformodelevaluation. Various models, including Simple Linear, Polynomial, Support Vector Regression, Decision Tree Regression, andRandomForestRegression,wereimplementedfor stockpriceprediction.Accuracyresultswereprovided for these models, with Random Forest Regression achieving the highest accuracy of 99.57%. In the classificationtask,LogisticRegressionachievedamean accuracyof68.622%,andconfusionmatrixvalueswere presented.Thedetailresultofregressionarepresented infigure2

Fig -2:ResultofaccuracyofRegressionalgorithm.[36]

Forclassification,theresultsofpredictionarepresented in Figure 3, which shows that SVM get the highest accuracyresult.

Fig -3:Resultofaccuracyofclassificationalgorithm. [36]

However,theauthormentionedthataccuracydoesnot representthepowerofalgorithmsasitstilldependson the data fed in. Also, the author did not provide justificationaboutthetimingandhowitbalances.

6. Theresearch[37]named‘Stockpricepredictionbased on multifactorial linear models and machine learning approaches’isfocusedonpredictingtheclosingprices ofninedifferentstocksusingvariouspredictionmodels andconsideringtheimpactof18differentfactors.The research process involves selecting nine stocks from differentindustriesandcollectingdailymarketdatafor these stocks from January 1, 2019, to December 31, 2021.However,thepaperdoesn'tmentionspecificsteps fordatacleaning.Itthendividesthedataintoatraining set(80%)andatestset(20%)totrainandevaluatethe models. However, the paper does not specifically describewhatthese18factorsare.Itprimarilyfocuses on technical factors such as KDJ, RSI, Bollinger Bands (Boll), Moving Average Convergence Divergence (MACD), as well as price-related data (e.g., opening price,highestprice,lowestprice,etc.)aspartofthe18 factors. However, the specific details or definitions of these factors are not provided in the paper. Four predictionmodelsareused:MultipleLinearRegression (MLR), Exponentially Weighted Moving Averages (EWMA), Extreme Gradient Boosting (XGBoost), and LongShortTermMemoryNetwork(LSTM).Accuracyis assessedusingMeanSquareError(MSE)andCoefficient of Determination (R-squared) for each model. The resultsindicatethatMLRandEWMAmodelsshowgood accuracy,whileXGBoostandLSTMmodelsperformless effectively,especiallywhendataislimited.

7. The research [38] named ‘Short Term Stock Price PredictionUsingDeepLearning’focusesonpredicting short-termstockpricemovementsusingdeeplearning algorithmfortendifferentstockslistedontheNewYork Stock Exchange. It involves the use of two different neuralnetworkmodels,namelyMultilayerPerceptron (MLP)andLongShort-TermMemory(LSTM),appliedto minute-by-minutestockpricedatacollectedoveraone-

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

year period. Data normalization techniques, which is min-max scalar were employed to ensure consistent data ranges. The study selected various technical financial indicators including trend, oscillator, and momentumindicatorsasfeaturestocapturedifferent aspects of stock price movements. Both models underwenttrainingandvalidationonthesamedataset topreventoverfitting.Theaccuracyofthemodelswas evaluatedusingRootMeanSquaredError(RMSE).The resultsofthecomparisonofMLPandLSTMareshown infigure4.

Fig -4:RSMEvalueofLSTMandMLPalgorithm.[38]

TheresearchconcludesthatMLPsoutperformedLSTM inaccuratelypredictingshort-termstockpricesasithad lowerRMSEvalue.However,theapplicabilityofthese models to long-term investment strategies was not explored, given the short-term nature of the dataset. While the research suggests the potential of neural networksinpredictingshort-termstockprices,itdoes notofferauniversallyeffectivepredictionmethodfor investorsandtraders.Furtherrefinementandvalidation may be needed to adapt these models for practical tradingapplications.

8. The primary focus of this research paper [54] named ‘StockPricePredictionusingMachineLearning’isstock price prediction using machine learning methods, specificallyLSTMandRegressionmodels.Thedataset used in the study is sourced from www.nseindia.com, covering50stocksfromJanuary1st,2000,toJuly31st, 2020. While the paper describes two models - a Regression-BasedModelandanLSTMNetwork-Based Model - it does not explicitly detail any data cleaning steps.

Theexperimentalresultssectionpresentstheoutcomes ofusingtheLSTMmodelforstockpriceprediction,with a particular emphasis on different training epochs. VisualrepresentationsillustratehowtheLSTMmodel's predictionscomparetoactualtrends.However,specific accuracyvaluesarenotprovided.Theauthorconcluded that increasing the epoch value can increase the precision.

9. The research [39] named ‘Prediction of the Stock AdjustedClosingPriceBasedOnImprovedPSO-LSTM NeuralNetwork.’primarilyfocusesonpredictingstock prices, particularly the adjusted closing price, by introducinganimprovedmodelknownasIPSO-LSTM.

This model combines a Long Short-Term Memory (LSTM) neural network with an improved Particle Swarm Optimization (PSO) algorithm. The study involves several key steps, starting with data preprocessing, where historical stock data is divided intotrainingandtestsetsandstandardizedtofacilitate analysis. Subsequently, the model's parameters are initialized, including those for the IPSO-LSTM model. The LSTM network is trained using the training data, while the PSO algorithm optimizes crucial hyperparameters during this process. The model's accuracy is assessed using various evaluation metrics like RMSE, MAPE, MAE, and R^2, with IPSO-LSTM consistently outperforming other baseline models as showninfigure5,theIPSO-LTSMhaslowererrorrate andhigherR-squarevalue.

-5:Themodel'saccuracyisassessedusingvarious evaluationmetrics

The predicted range of the dataset is 100 days. The research demonstrates the model's robustness when applied to different stock indexes, which are the Dow Jones Industrial Average Index (DJI) and Nasdaq CompositeIndex(IXIC).Theresultsareshowninfigure 6, which demonstrate the gap between actual and predictionisverynear.

-6:IPSO-LTSMperformanceonDJIandIXIC.[39]

Fig

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

10. The research [40] named ‘Stock Price Prediction and RecommendationApproachBasedonMachineLearning’ usingmachinelearning,specificallyLightGBM,topredict andrecommendstockswithintheTaiwanstockmarket. Theprimarygoalistoestablishasystematicapproach forselectingstocksanddeterminingthe besttimesto buy and sell them. The study draws a comparison between the performance of machine learning-driven stock recommendations and two well-known Taiwan ETFs,ETFs0050and0056.Theresearchprocessstarts with data collection from the Taiwan Stock Exchange website,encompassingfundamental,technical,andchiprelated data. Subsequently, the data undergoes preprocessing to eliminate any erroneous values. It's then segmented into training and test datasets for furtheranalysis.Machinelearningentersthescene,with theLightGBMmodelbeingtrainedusingtheprepared datasets.Followingthis,modelsettingsaredefined,and stocksarerankedbasedonthemodel'spredictions.A specifiednumberofthetop-rankedstocksarechosen, andtheirperformanceisevaluatedthroughbacktesting. As for results, it suggests that the machine learningbased approach outperforms Taiwan ETFs 0050 and 0056intermsofannualizedreturnandvolatility.

11. The research [41] named ‘Stock Price Forecasting on Telecommunication Sector Companies in Indonesia Stock Exchange Using Machine Learning Algorithms’ Focuses on predicting stock prices for five telecommunications companies in Indonesia using machine learning techniques, particularly Gaussian ProcessandSMOreg.Thestudy'sstepsincludecollecting historical stock price data spanning from January 1, 2017,toDecember31,2019,fromYahooFinanceand converting it into a usable CSV format. While data cleaning is mentioned, specific details on the cleaning process are absent. The dataset is then divided into training and testing sets in a 70:30 ratio. Metrics like RMSE, MAPE, and MBE are used to test accuracy. The results concluded that SMOreg outperforms the GaussianProcess.

12. Theresearch[42]named‘PredictionofTrendsinStock MarketusingMovingAveragesandMachineLearning.’ focusesonusingmachinelearningtoimprovetrading signals generated by moving averages in the stock market.Itaimstoimprovetheaccuracyandtimeliness of these signals, particularly in relation to moving averagecrossovers.Thedatapreprocessingstepsofthis researchareillustratedinfigure7.

From figure 7, it illustrates that the model will only considertheclosingpriceandthedate.Thenthemoving averageiscalculatedandthedatathatisusedtotrainis movingaverage.Afterthatthecrossoverischecked.The result of research demonstrates that the proposed machinelearningmodelcanimprovetheaccuracyand timing of trading signals based on moving average crossovers.

13. The research [43] ‘A Novel Approach to Improve Accuracy in Stock Price Prediction using Gradient Boosting Machines Algorithm compared with Naive BayesAlgorithm’focusesonassessingtheaccuracyof stock price predictions using two machine learning algorithms: Gradient Boosting Machines (GBM) and Naive Bayes. The study involves several key steps, starting with data collection from the National Stock Exchange. Subsequently, data preprocessing is performed to clean the dataset by removing null and missing values and converting text-based data into a suitable format. The dataset is then split into a 25% trainingsetanda75%testsetforevaluation.Boththe GBMandNaiveBayesalgorithmsareappliedtopredict stockprices,andthestudyemploysstatisticalanalysis usingIBMSPSSandGoogleColab.Theaccuracyofthe predictionmethodsisapivotalaspectoftheresearch. GBMachievesanaccuracyrateof92.3%,whereasNaive Bayes lags slightly behind with an accuracy of 87.7%. TheresultsclearlyshowthatGBMoutperformsNaive Bayesintermsofpredictionaccuracy.

14. The research[44] paper named ‘Enhanced Extreme LearningMachineAlgorithmwithDeterministicWeight ModificationforInvestmentDecisiononIndianStocks’ focuses on predicting stock prices in the Indian stock

Fig -7:Flowchartofproposedmodel.[42]

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

market and introduces a new machine learning algorithmcalledDELM,designedtoimproveprediction accuracy and convergence rates. The study involves steps like using benchmark stock market datasets, includingNifty50,S&PBSESensex,StateBankofIndia (SBIN),andICICIBank(ICICI).Technicalindicatorsare usedtoextractrelevantinformationfromfinancialdata, butthepaper doesn'tspecifydata cleaningsteps.The accuracy of prediction methods is evaluated using metrics like RMSE, MAE, and DS. The results indicate that DELM outperforms other algorithms in terms of predictionaccuracy.

15. The research [45] named ‘Prediction of Stock Price Direction with Trading Indicators using Machine LearningTechniques’Focusesonpredictingstockprice directions using machine learning and trading indicators. The step-by-step process involved data collection from Yahoo! Finance, followed by feature extraction,labelingstocksas"upwards,""downwards," or"neutral"basedoncertaincriteria,andbalancingthe datasettoensureanequalnumberofrecordsforeach class. Feature elimination was also performed to improve model efficiency. Six different classification models were applied, with the Random Forest model yielding the best performance. The whole process is illustratedinfigure8.

Fig -8:Flowchartofproposedmodel.[45]

The research conclude that all the machine learning modelsperformedwell,withRandomForestperforming thebest.

16. Theresearch[46],named‘StockPricePredictionBased OnLstmAndBert’primarilyrevolvesaroundpredicting stockpricesforthreeChineselistedcompanies,namely PingAnBank,ZTE,andMuYuan.Theresearchersadopta

2025, IRJET | Impact Factor value: 8.315 |

step-by-step approach in their study. Initially, they gatheressentialstockmarketdata,includingfactorslike opening and closing prices, volume, and trading amounts,spanningfromJanuary2,2019,toSeptember 24, 2021. Simultaneously, they collect a substantial amountoftextdatafromvariousonlinesources,totaling 67,981posttitlesforPingAnBank,398,198forZTE,and 109,956 for MuYuan. The dataset is then divided into two parts, with 85% earmarked for training and the remaining15%setasidefortestingthemodel.Torefine thedataforanalysis,theresearchersemploysentiment analysis using the BERT model. This involves categorizing the sentiment expressed in the collected text data into three categories: positive, neutral, or negative.Theresearchhingesontheimplementationof a specific machine learning model called LSTM (Long Short-TermMemory)topredictstockprices.TheLSTM model consists of two layers, and various standard practices,suchasdropoutlayers,theAdamoptimizer, and the Mean Squared Error (MSE) loss function, are appliedinitsconfiguration.Themodel'sperformanceis assessedusingseveralkeyevaluationmetrics,including the Mean Absolute Error (MAE), Mean Absolute PercentageError(MAPE),MeanSquaredError(MSE), RootMeanSquaredError(RMSE),andAccuracy.These metricsprovideacomprehensiveunderstandingofthe model'spredictivecapabilities.Asresults,theresearch concludes that BERT-LSTM model which includes sentiment analysis, will improve the performance of LTSMmodel.

17. The research [47] named ‘A Hybrid Model for Stock Price Prediction using Machine Learning Techniques with CNN’ focuses on predicting stock prices using a hybridmodelthatcombinesLSTMandCNNtechniques, aiming to benefit both short-term and long-term investors. The research follows several key steps, beginning with the collection of historical stock price datafromtheYahooFinanceAPI,whichincludesvarious indicatorslikeopeningandclosingprices,highandlow values,andtradingvolumes.Whilethepapermentions data scaling and dataset splitting for training and testing, it does not explicitly detail data cleaning processes,suchasaddressingmissingvaluesoroutliers. In terms of model construction, the research builds neuralnetworkstructuresforbothLSTMandCNN.The paperconcludedthattheCNN-LSTMmodelachievedthe highest R-squared (R2) value of 0.90, which is an indicator of its accuracy. This represents an improvementof2.8%and0.55%relativeto theother twoapproaches.

18. Theresearch[48]named‘PreliminaryInvestigationin the use of Sentiment Analysis in Prediction of Stock Forecasting using Machine Learning’ focuses on improving stock price forecasting by combining sentiment analysis with machine learning models. It conducts three experiments, testing the impact of sentiment analysis scores, historical stock price data,

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

and their combination on prediction accuracy. The results show that combining sentiment analysis with historicaldataenhancesaccuracy,withartificialneural network(ANN)classifiersperformingbest.

19. Theresearch[1]named‘AnEnsembleLearningModel Integrating Short-term Trend and Long-term Trend UsedinStockPriceForecasting’focusesondevelopinga stockpricepredictionmodelthatenhancesaccuracyby combining short-term and long-term trends using the Support Vector Regression (SVR) model. The process begins with collecting historical stock price data, including opening and closing prices, high and low prices,andtradingvolume. After cleaningthedata by removinginvalid entries,a targetvariableisadded to represent the closing price of the next day. The SVR modelisthentrainedonthisdataforeachcompany.An ensemblelearningapproachisintroducedtocombine the short-term and long-term SVR models for more accurate stock price predictions. The algorithm of combinationareshowninfigure9.

Themodel'sperformanceisevaluatedusingRootMean Squared Error (RMSE) and determination coefficient (R2) for various companies. The results indicate improvementsinaccuracywiththeensemblelearning model.However,thetextdoesnotspecifywhetherthe model is better suited for long-term or short-term investors, and it does not provide details on the prediction date range or data processing date range. Additionally,theresearchdoesnotconcludeonspecific effective prediction methods for short-term or longtermstockpriceforecasting.

20. The research [49] named ‘Analysis and Prediction of StockPriceUsingHybridizationofSARIMAandXGBoost’ focusesonpredictingpubliclytradedstockpricesusing machinelearningtechniques,specifically SARIMAand XGBoost,basedonhistoricaldatafromYahooFinance. The study proceeds through several steps: data

collection from Yahoo Finance, data pre-processing involving the removal of NULL values to ensure data quality,timeseriesdecompositiontounderstandtrend, seasonality, and noise in the data, modeling with SARIMAwithsomeadjustments,testingthemodelusing aseparatedataset,andanalyzingresultsforaccuracy. Allthestepareshowninfigure10.

Fig -10:AlgorithmstepofSARIMA-XGBoosthybrid model.[49]

The SARIMA-XGBoost hybrid model achieves an accuracy rate of 89.48%, with a Mean Absolute Error (MAE)of15.612andMeanAbsolutePercentageError (MAPE) of 10.52%. However, the research does not explicitlyspecifywhetherthispredictivemodelisbetter suitedforshort-termorlong-terminvestors,nordoesit provide information regarding the specific date range forstockpriceprediction.

21. Theresearch[50]named‘StockMarketPredictionUsing Hidden Markov Model’ focuses on predicting stock marketfluctuationsusingmachinelearningtechniques likeNeuralNetworks,SupportVectorMachine(SVM), and Hidden Markov Model (HMM). It begins with an introduction emphasizing the importance of stock marketprediction,followedbyaliteraturereviewthat discusses the various methods used in this field. The researcherscollectandpreparespecificstockdatasets forICICI,SBI,andIDBI,specifyingtrainingandtesting periods.TheyimplementtheHMMforstockprediction and evaluate its accuracy using Mean Absolute PercentageError(MAPE),presentingMAPEvaluesfor theselectedstocks.Whiletheresearchdoesn'texplicitly mention data cleaning, it highlights the HMM's better accuracycomparedtotraditionaltechniques.

22. The research [51], named ‘A Stock Prediction Method BasedonFakeInformationIdentificationandMachine Learning’ focuses on a stock prediction method that combinesfakeinformationidentificationwithmachine learning techniques. It involves a series of steps, beginningwithdatacollectionfromTwittercomments

Fig -9:AlgorithmofensembleKearningModel.[1]

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

and stock price data from the Tushare Data API. The gathered data is then pre-processed, which includes tasks like tokenization and stemming, while ensuring datacleanliness.Featureselectionmethodssuchasbagof-words,POStagging,andword2vecareemployedto prepare the textual data for classification. Fake news classification is carried out using five different classifiers,withLogisticRegressionbeingchosenasthe best-performingmodelfordistinguishingbetweentrue and false news based on the selected features. Additionally, the study explores the accuracy of two stockpricepredictionmodels,LSTMandGRU,withthe latterdemonstratingsuperiorperformanceintermsof accuracymetrics.However,nodetailedalgorithmsare providedinthisresearch.Thisapproachhighlightsthe potential of integrating fake news detection with machinelearningforstockprediction,showcasingGRU as an effective model for accurate stock price forecasting.

23. The research [52] named ‘Recursive Stock Price PredictionWithMachineLearningAndWebScrapping For Specified Time Period’ focuses on using Machine Learning, specifically the Random Forest Regression algorithm,topredictstockprices.Itincorporatesfactors suchasopen,high,low,andcloserates,tradingvolume, Price to Earning Ratio, Moving Average (MA), and Moving Average Convergence Divergence (MACD) to enhancepredictionaccuracy.Additionally,webscraping is utilized to gather current market data. The methodologyincludesdatacollectionfromtheNational Stock Exchange, data pre-processing for cleaning and preparation, model training with Random Forest Regression, and a recursive approach for forecasting long-termfuturestockprices.

Fig -11:Structuretosetuptherecursivemodel dataset.[52]

Theimplementationdetailsoftherecursivemodelare showninfigure11.However,thepaperisonlyfocused

onimplementationbutnotprovidedtheaccuracyofthe datainlongterm.

The research [24] named ‘Application of Singular SpectrumAnalysisandKernel-basedExtremeLearning Machine for Stock Price Prediction’ focuses on the application of Singular Spectrum Analysis (SSA) combinedwithKernel-basedExtremeLearningMachine (KELM)forstockpriceprediction.Itaimstotacklethe challenge of accurately predicting stock prices, with a particularfocusonimprovingthespeedofprediction. The study uses three different stock price datasets, includingtheStockExchangeofThailandindex(SET), theStandard&Poor's500returnindex(S&P500),and thestockmarketreturnindexofJapan(Nikkei225).In termsofdatapreprocessing,thestockpricedataisfirst normalized and subjected to SSA to reduce noise and improvedataquality.TheSSAhelpsindetrendingthe data and creating lagged matrices, which are then transformedintosingularvalues.Thesereconstructed seriesaresubsequentlyusedforstockpriceprediction. For the prediction phase, the research employs the Kernel-based Extreme Learning Machine (KELM). Various parameters such as the kernel type, regularizationparameter(C),andkernelparameter(σ) areselectedthroughagridsearchalgorithm.Allmodels aretrainedwiththesameparametersettings.Thestudy evaluatestheperformanceofdifferentmodels,including SSA-KELM, SSA-LSSVM, SSA-SVM, KELM, LSSVM, and SVM,using metricsincluded root mean squared error (RMSE),meanabsolutepercentageerror(MAPE),mean absolutedeviation(MAD),directional symmetry(DS), and training time. The experimental results demonstrate that the SSA-based models outperform non-SSAmodelsintermsofaccuracy.Specifically,SSAKELM exhibits the highest accuracy and the shortest trainingtimeamongtheSSA-basedmodels,makingitan efficientmodelforstockpriceprediction.Thisresearch indicatesSSA-KELM’scapabilityasaswiftandaccurate tool for stock price forecasting, although specific implementationdetailslikealgorithmicflowchartsare notprovided.

4. CONCLUSIONS

Drawing from the research findings and the theoretical framework of data analysis and machine learning data processingdiscussedintheliteraturereview,thefollowing conclusionscanbemade:

DataCollection:

Dataforstockpricepredictioncanbeobtainedfromvarious sources,asobservedinthereviewedresearchpapers: • YahooFinanceAPI:Thissourceisfrequentlyused for collecting historical stock price data, encompassing parameterssuchasopeningandclosingprices,highandlow values,andtradingvolumesinseveralresearchpapers.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

• Online Text Data: In certain studies, researchers havegatheredtextualdatafromdiverseonlinesources.This textualdataisvaluableforsentimentanalysisandacquiring supplementary information related to the stocks under consideration.

• TwitterComments:Inonespecificresearchpaper, Twittercommentswerecollectedandharnessedfortasks likeidentifyingfakeinformationandconductingsentiment analysisinthecontextofstockpriceprediction.

• TushareDataAPI:Anotherresearchpaperadopted theTushareDataAPItoacquirestockpricedata.

• WebScraping:Inthecontextofrecursivestockprice prediction,webscrapingwasemployedasamethodfordata collectionandpreprocessing.Thisapproachfacilitatedthe acquisitionofup-to-datemarketdatafromonlinesources.

DataPreprocessing: Afterdownloadthedatafromdatasource,priortomodeling, datapreprocessingisimperativeandincludesthefollowing steps, which were identified in the reviewed research papers:

• DataScaling:Somepapersemphasizedatascalingas an initial preprocessing step. This entails normalizing or standardizingthedatatoensurethatallfeaturesareonthe samescale,whichcanenhancetheperformanceofmachine learningmodels.

• DataCleaning:Whilenotexplicitlydetailedinsome papers,datacleaningisfoundational.Itencompassestasks suchashandlingmissingvalues,eliminatingduplicates,and addressingoutlierstoensurethequalityofthedataset.

• Time Series Decomposition: One research paper highlighted time series decomposition as part of data preprocessing. This step helps in understanding the underlyingtrend,seasonality,andnoisecomponentswithin timeseriesdata,whichisbeneficialforaccurateforecasting models.

• Tokenization:Inthecontextofsentimentanalysis, tokenization was mentioned as part of text data preprocessing.Tokenizationinvolvessegmentingtextinto individualwordsortokens,makingitamenabletofurther analysis.

• Stemming:Stemming,asmentionedinonepaper,is atextpreprocessingtechniquethatreduceswordstotheir rootorbaseform.Thisstandardizestextdataandreduces dimensionality.

• FeatureSelection:Ina researchpaper focused on fakenewsidentification,featureselectionmethodssuchas bag-of-words, POS tagging, and word2vec were applied. These methods are employed to prepare textual data for classificationtasks.

• Data Transformation: In the case of Singular SpectrumAnalysis(SSA),a researchpaperemployeddata transformation techniques to enhance data quality. This involved detrending the data and generating lagged matrices.

• Feature engineering is a critical step in the data preprocessingphase.Itinvolvesselectingandengineering

relevantfeaturesthatcancontributetoimprovedprediction accuracy.

• Techniques such as sentiment analysis, technical indicators, and news sentiment can be explored to gain additionalinsights.

ModelSelection:

Afterdonethedatapreprocessing,selectinganappropriate machine learning or deep learning model is pivotal for accuratestockpriceprediction.Thechoiceofmodeldepends onthepurposeasmentionedintable1.

ModelTrainingandEvaluation:

Followingmodelselection,thefollowingmatricesarecritical toevaluatetheperformance:

• MeanAbsoluteError(MAE):Usedtoevaluatethe accuracyofstockpricepredictionmodels.

• MeanAbsolutePercentageError(MAPE):Employed toassesspredictionaccuracyasapercentage.

• RootMeanSquaredError(RMSE):Usedtomeasure theaccuracyofpredictionmodels,especiallyinregression tasks.

• Accuracy: Evaluated in some research papers, especiallywhentheprobleminvolvesclassificationofstock trends(e.g.,up,down,orneutral).

• R-squared(R2):Usedtomeasurethegoodnessoffit inregressionmodels,indicatinghowwellthemodelfitsthe data.

• Directional Symmetry (DS): Mentioned in one research paper as an evaluation metric for stock price prediction.

• Incorporatingrecursivetechniquesintheanalysis canfurtherenhancethepredictivecapabilitiesofthemodel. While the research papers reviewed in the previous conversationprovidevaluableinsightsintotheapplication ofmachinelearninganddataanalysistechniquesforstock priceprediction,thereareseverallimitationsthatshouldbe acknowledged. Firstly, the performance of the proposed modelsmaybehighlydependentonthequalityandquantity of data used for training and testing. Variability in data sourcesanddatapreprocessingmethodscouldimpactthe generalizability of the models. Additionally, the research papersoftendonotexplicitlyaddressissuesrelatedtodata stationarity and non-stationarity, which are crucial considerationsintimeseriesforecastingtasks.Furthermore, thepredictionhorizonsandinvestmentstrategiestargeted by the models are not consistently defined, making it challengingtoassesstheirsuitabilityforshort-termorlongterminvestmentdecisions.Also,usersneedtofurtherstudy tohaveadeepunderstandingofalgorithmstouseit.

Futureresearchinthefieldofstockpricepredictionusing machine learning can address some of these limitations. Research can be done by using standardized datasets and evaluationmetricstoenablemoremeaningfulcomparisons betweendifferentmodelsandenhancethereliabilityoftheir findings.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

REFERENCES

[1] An ensemble learning model integrating short-term trend and long-term trend used in stock price forecasting.(2020,December1).Retrieve8September 2023,fromIEEEConferencePublication|IEEEXplore. https://ieeexplore.ieee.org/document/9532165

[2]Chen,J.(2023b).Whatisthestockmarket,whatdoesit do, and how does it work? Investopedia. Retrieve 8 September 2023, from https://www.investopedia.com/terms/s/stockmarket.a sp

[3] Chen, J. (2023a). Stock analysis: Different methods for evaluatingstocks.Investopedia.Retrieve8September 2023, from https://www.investopedia.com/terms/s/stockanalysis.asp#:~:text=Stock%20analysis%20is%20a%20 method,markets%20by%20making%20informed%20d ecisions.

[4]Segal,T.(2023).Fundamentalanalysis:principles,types, andhowtouseit.Investopedia.Retrieve8September 2023, from https://www.investopedia.com/terms/f/fundamentala nalysis.asp#:~:text=Fundamental%20analysis%20is%2 0a%20valuation,in%20and%20its%20financial%20perf ormance.

[5]Hayes,A.(2022).Technicalanalysis:Whatitisandhow to use it in investing. Investopedia. Retrieve 8 September 2023, from https://www.investopedia.com/terms/t/technicalanaly sis.asp

[6] Staff, D. (2023, June 6). 5 Best Short Term Trading IndicatorsforTechnicalAnalysis-DTTWTM.DayTrade The WorldTM. Retrieve 8 September 2023, from https://www.daytradetheworld.com/tradingblog/short-term-trading-indicators/

[7]Onigbanjo,T.(2021,December31).3Simpletechnical indicatorsforlong-terminvesting.Medium.Retrieve8 September 2023, from https://medium.datadriveninvestor.com/3-simpletechnical-indicators-for-long-term-investinga100f02b9bed

[8] What is machine learning? IBM. (2022) Retrieve 8 September 2023, from https://www.ibm.com/cloud/learn/machine-learning

[9] Castillo, D. (2023). Machine learning regression explained. Seldon. Retrieve 8 September 2023, from https://www.seldon.io/machine-learning-regressionexplained#:~:text=Machine%20Learning%20Regressio n%20is%20a,used%20to%20predict%20continuous%2 0outcomes.

[10] Linear Regression in Machine learning - Javatpoint. (2023). Retrieve 8 September 2023, from www.javatpoint.com.

https://www.javatpoint.com/linear-regression-inmachine-learning

[11] Satyavishnumolakala. (2021, December 14). Linear Regression -Pros & Cons - SatyavishnumolakalaMedium. Medium. Retrieve 8 September 2023, from https://medium.com/@satyavishnumolakala/linearregression-pros-cons-62085314aef0 [12] GeeksforGeeks. (2023). Python implementation of polynomial regression. GeeksforGeeks. Retrieve 8 September 2023, from https://geeksforgeeks.org/python-implementation-ofpolynomial-regression/

[13] Hayes, A. (2023). Multiple Linear Regression (MLR) definition,formula,andexample.Investopedia.Retrieve 8 September 2023, from https://www.investopedia.com/terms/m/mlr.asp [14]StatisticsSolutions.(2021,August11).Assumptionsof multiplelinearregression-Statisticssolutions.Retrieve 8 September 2023, from https://www.statisticssolutions.com/freeresources/directory-of-statisticalanalyses/assumptions-of-multiple-linear-regression/ [15]GeeksforGeeks.(2023a).LassovsRidgevsElasticNet ML.GeeksforGeeks. Retrieve8September2023,from https://www.geeksforgeeks.org/lasso-vs-ridge-vselastic-net-ml/

[16] Science, N. B. P. a. D. (2021). Understanding ARIMA models for Machine learning. Capital One. Retrieve 9 September 2023, from https://www.capitalone.com/tech/machinelearning/understanding-arima-models/ [17] K-Nearest Neighbor(KNN) algorithm for machine learning - JavatPoint. (2023). www.javatpoint.com. Retrieve 9 September 2023, from https://www.javatpoint.com/k-nearest-neighboralgorithm-for-machinelearning#:~:text=K%2DNN%20algorithm%20assumes %20the,point%20based%20on%20the%20similarity. [18]Apracticalintroductiontomovingaveragetimeseries model. (2023b, July 15). ProjectPro. Retrieve 9 September 2023, from https://www.projectpro.io/article/moving-averagetime-series-model/716 [19] GeeksforGeeks. (2023c). Introduction to recurrent neuralnetwork.GeeksforGeeks.Retrieve9September 2023, from https://www.geeksforgeeks.org/introduction-torecurrent-neural-network/ [20] What are LSTM Networks - Javatpoint. (2023). www.javatpoint.com.Retrieve9September2023,from https://www.javatpoint.com/what-are-lstm-networks [21] Bressler, N. (2023, March 23). How to check the accuracyofyourmachinelearningmodel.Deepchecks. Retrieve 9 September 2023, from https://deepchecks.com/how-to-check-the-accuracy-ofyour-machine-learningmodel/#:~:text=Accuracy%20score%20in%20machine %20learning%20is%20an%20evaluation%20metric%2 0that,the%20total%20number%20of%20predictions.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

[22]Zhang,W.,Chen,Z.,Miao,J.,&Liu,X.(2022).Research on Graph Neural Network in stock market. Procedia ComputerScience,214,786–792.Retrieve9September 2023, from https://doi.org/10.1016/j.procs.2022.11.242

[23]Parashar,N.(2023,January11).WhatisanAccuracy ScoreandHowtoCheckit?-NileshParashar-Medium. Medium. Retrieve 9 September 2023, from https://medium.com/@niitwork0921/what-is-anaccuracy-score-and-how-to-check-it-13b23eeed6a3 [24]Odmark,J.(2022).Whatisloglossinmachinelearning? Pandio. Retrieve 9 September 2023, from https://pandio.com/what-is-log-loss-in-machinelearning/

[25]Narkhede,S.(2021,June15).UnderstandingConfusion Matrix - towards Data science. Medium. Retrieve 9 September 2023, from https://towardsdatascience.com/understandingconfusion-matrix-a9ad42dcfd62

[26] AUC-ROC curve in machine learning - Javatpoint. (2023). www.javatpoint.com. Retrieve 9 September 2023, from https://www.javatpoint.com/auc-roccurve-in-machine-learning

[27] Great Learning Team. (2022, November 18). Mean squared Error: Definition, applications and examples. Great Learning Blog: Free Resources What Matters to Shape Your Career! Retrieve 9 September 2023, from https://www.mygreatlearning.com/blog/mean-squareerror-explained/

[28] Kundu, R. (2023, April 20). F1 Score in Machine Learning:Intro&Calculation.V7.Retrieve10September 2023, from https://www.v7labs.com/blog/f1-scoreguide#:~:text=The%20F1%20score%20can%20be,%2F macro%2Fweighted%2Fnone.

[29] GeeksforGeeks. (2023b). ML Understanding Data Processing. GeeksforGeeks. Retrieve 10 September 2023, from https://www.geeksforgeeks.org/mlunderstanding-data-processing/

[30]GeeksforGeeks.(2023f).MLOverviewofdatacleaning. GeeksforGeeks. Retrieve 10 September 2023, from https://www.geeksforgeeks.org/data-cleansingintroduction/

[31] GeeksforGeeks. (2021). ML Feature Scaling Part 1. GeeksforGeeks. Retrieve 10 September 2023, from https://www.geeksforgeeks.org/ml-feature-scalingpart-1/

[32]StockPricesPredictionUsingMachineLearning.(2021, September 23). IEEE Conference Publication | IEEE Xplore. Retrieve 10 September 2023, from https://ieeexplore.ieee.org/document/9617222

[33]StockPredictionandanalysisUsingSupervisedMachine Learning Algorithms. (2021, November 26). IEEE Conference Publication | IEEE Xplore. Retrieve 10 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9697162

[34]AnalysingtheTrendofStockMarketandEvaluatethe performance of Market Prediction using Machine LearningApproach.(2022,January28).IEEEConference Publication|IEEEXplore.Retrieve10September2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9752616

[35] Analysis of Stock Price Prediction using Machine Learning Algorithms. (2022, January 21). IEEE Conference Publication | IEEE Xplore. Retrieve 10 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9725888

[36] Prediction of Stock Prices using Machine Learning (Regression,Classification)Algorithms.(2020,June1). IEEEConferencePublication|IEEEXplore.Retrieve10 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9154061

[37] Stock price prediction based on multifactorial linear models and machine learning approaches. (2022, December 11). IEEE Conference Publication | IEEE Xplore. Retrieve 10 September 2023, from https://ieeexplore.ieee.org/document/10016086

[38]Shorttermstockpricepredictionusingdeeplearning. (2017, May 1). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =8256643

[39]PredictionoftheStockAdjustedClosingPriceBasedOn ImprovedPSO-LSTMNeuralNetwork.(2022,September 9). IEEEConferencePublication|IEEEXplore. Retrieve 11 September 2023 , from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9941330

[40]Stockpricepredictionandrecommendationapproach based on machine learning. (2022, October 28). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =10042922

[41] Stock price Forecasting on telecommunication sector companiesinIndonesiaStockExchangeusingmachine learning algorithms. (2020, October 27). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9320758

[42] Prediction of Trends in Stock Market using Moving AveragesandMachineLearning.(2021b,April2).IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9418097

[43]ANovelApproachtoImproveAccuracyinStockPrice PredictionusingGradientBoostingMachinesAlgorithm compared with Naive Bayes Algorithm. (2022,

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056

Volume: 12 Issue: 04 | Apr 2025 www.irjet.net p-ISSN: 2395-0072

December 16). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =10074387

[44]EnhancedExtremeLearningMachineAlgorithmwith Deterministic Weight Modification for Investment Decision on Indian Stocks. (2022, October 20). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9951899

[45] Prediction of Stock Price Direction with Trading IndicatorsusingMachineLearningTechniques.(2022, December 30). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =10150983

[46]StockPricePredictionBasedOnLstmAndBert.(2022, September 9). IEEE Conference Publication | IEEE Xplore. Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9941293

[47] A Hybrid Model for Stock Price Prediction using MachineLearningTechniqueswithCNN.(2021,October 22).IEEEConferencePublication|IEEEXplore.Retrieve 11 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9702382

[48] Preliminary Investigation in the use of Sentiment Analysis in Prediction of Stock Forecasting using MachineLearning.(2020,March28).IEEEConference Publication|IEEEXplore.Retrieve11September2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9368258

[49] Analysis and prediction of stock price using hybridizationofSARIMAandXGBOOST.(2022,March 10).IEEEConferencePublication|IEEEXplore.Retrieve 12 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9767868

[50]StockmarketpredictionusingHiddenMarkovModel. (2014,December1).IEEEConferencePublication|IEEE Xplore. Retrieve 12 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =7065011

[51]Astockpredictionmethodbasedonfakeinformation identificationandmachinelearning.(2022,October1). IEEEConferencePublication|IEEEXplore.Retrieve12 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =10129429

[52] Recursive Stock Price Prediction With Machine LearningAndWebScrappingForSpecifiedTimePeriod. (2019,December1).IEEEConferencePublication|IEEE Xplore. Retrieve 12 September 2023, from

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =8995080

[53]Applicationofsingularspectrumanalysisandkernelbased extreme learning machine for stock price prediction.(2016,July1).IEEEConferencePublication| IEEE Xplore. Retrieve 12 September 2023, from https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =7748873

[54]StockPricePredictionusingMachineLearning.(2022, March16).IEEEConferencePublication|IEEEXplore. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber =9752248