NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A REVIEW

Md Arif1, Mr. Nadeem Ahmad2

1M.Tech, Electronic and Communication Engineering, GITM, Lucknow, India

2Assistant Professor Electronic and Communication Engineering, GITM, Lucknow, India

Abstract - Thompson Sampling (TS) is a popular algorithm used in multi-armed bandit problems. In the standard setting, it assumes that the rewards associated with each arm are stationary, which means that the reward distribution remains fixed throughout the experiment. However, in many real-world scenarios, the reward distribution can change over time, and this is known as a non-stationary bandit problem. In this case, the traditional TS algorithm may not perform well. To address this issue, several extensions of the standard TS algorithm have been proposed for non-stationary bandits. One such extension is the Bayesian Online Changepoint Detection (BOCD) algorithm. BOCD uses a Bayesian framework to model the changes in reward distribution and adjust the exploration and exploitation trade-off accordingly. The BOCD algorithm maintains a posterior distribution over the possible locations of the change points in the reward distribution. At each time step, it uses this posterior to compute the probability that a change point has occurred. If the probability of a change point is high, the algorithm explores more to adapt to the new reward distribution. Otherwise, it exploits more to maximize its expected reward.

Another extension of the standard TS algorithm for nonstationary bandits is the Dynamic Thompson Sampling (DTS) algorithm. DTS uses a sliding window approach to detect changes in the reward distribution. The algorithm maintains a separate posterior distribution for each window and selects the arm with the highest expected reward based on the posterior distribution of the current window. In summary, Thompson Sampling is a powerful algorithm for the multiarmed bandit problem, and several extensions can be used to handle non-stationary bandits. These extensions allow the algorithm to adapt to changes in the reward distribution over time and continue to make optimal decisions.

Key Words: Sampling, Algorithm for Non-Stationary Bandits,StationaryBandits,detectionbased.

1. INTRODUCTION

The multi-armed bandit (MAB) framework is a classic problem in decision theory and reinforcement learning. It involves an agent (or decision-maker) who must choose betweenasetofactions(oftencalled"arms"),eachofwhich hasanunknownrewarddistribution.Thegoaloftheagentis tomaximizeitscumulativerewardovertimewhilebalancing

the exploration of new actions with the exploitation of actionsthathavealreadybeenfoundtoberewarding.

One common example of the MAB problem is the slot machineor "one-armed bandit" problem. In this scenario, theagentmustchoosebetweenpullingtheleversofseveral slot machines, each of which has a different payout distribution.Theagentmustdecidehowmanytimestopull eachlevertomaximizeitstotalpayout.

TherearemanyvariationsoftheMABproblem,eachwith differentassumptionsandobjectives.Onecommonapproach is the epsilon-greedy algorithm, which chooses the action with the highest estimated reward with probability 1epsilon, and chooses a random action with probability epsilon.Thisbalancestheexploitationofknownrewarding actionswiththeexplorationofnewactions.

Other popular algorithms for the MAB problem include UCB1, Thompson Sampling, and EXP3. These algorithms differintheirassumptionsaboutthedistributionofrewards, and their strategies for balancing exploration and exploitation.

The MABproblem hasmanyapplications in fieldssuchas advertising,finance,andhealthcare.Forexample,inonline advertising,advertisersmustdecidewhichadstodisplayto maximize click-through rates while balancing the need to explorenewadswiththeneedtoexploiteffectiveads.

2. UPPER CONFIDENCE BOUND (UCB) ALGORITHM

TheUpperConfidenceBound(UCB)algorithmisapopular algorithm for multi-armed bandit problems, which are a classofsequentialdecision-makingproblems.Inthemultiarmed bandit problem, a decision maker must repeatedly choosebetweenasetofactions(or"arms")withuncertain rewards,tomaximizetheirtotalrewardovertime.

The UCB algorithm works by balancing the exploration of differentactionswiththeirpotentialrewards.Ateachtime step,thealgorithmchoosestheactionwiththehighestupper confidencebound,whichisameasureoftheuncertaintyof theestimatedrewardforthataction.

Theupperconfidenceboundiscalculatedasthesumoftwo terms:theestimatedrewardfortheactionandaconfidence intervaltermthattakesintoaccounttheuncertaintyinthe

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page365

***

estimate.Theconfidenceintervaltermtypicallygrowsover timetoensurethatthealgorithmcontinuestoexplorenew actions, even as the estimates of the rewards for existing actionsbecomemorecertain.

By balancing exploration and exploitation in this way, the UCB algorithm can achieve good performance in a wide rangeofmulti-armedbanditproblems.

3. ALGORITHM- DISCOUNTED THOMPSON SAMPLING (DTS)

DiscountedThompsonSampling(DTS)isadecision-making algorithm that is often used in the context of online advertising, recommendation systems, and other settings where there is uncertainty about the effectiveness of differentactionsorchoices.DTSisamodificationofthewellknownThompsonSamplingalgorithm,whichisaBayesian approachtodecision-makingthathasbeenwidelyusedin machinelearningandartificialintelligence.

The basic idea behind DTS is to take into account the fact thatthevalueoffuturerewardsmaydecreaseovertime,due tofactorssuchasdiscountingordecay.Thismeansthatit may be more important to take action now, rather than waiting for more information to become available. To incorporatethisideaintotheThompsonSamplingalgorithm, DTSusesadiscountfactorthatreducestheweightgivento futurerewards.

InDTS,eachactionorchoiceisassociatedwithaprobability distributionoverthepossiblerewards.Ateachstepofthe algorithm, a random sample is drawn from each of these distributions, and the action with the highest sample is chosen.Thedistributionforeachactionisupdatedbasedon theobservedreward,usingBayesianinference.Thediscount factorisappliedtotheobservedrewardbeforeitisusedto updatethedistribution.

One of the advantages of DTS is that it can handle nonstationaryenvironments,wheretheunderlyingprobabilities orrewarddistributionsmaychangeovertime.Thediscount factor allows the algorithm to adjust to these changes, by placingmoreweightonrecentobservations.

DTSiseffectiveinavarietyofapplications,includingonline advertisingandrecommendationsystems.However,likeall decision-makingalgorithms,itseffectivenessdependsonthe specificcontextandthequalityoftheinputdata.

3.1.Stochastic Bandits

Stochasticbanditsareaclassofdecision-makingproblemsin whichanagentmustselectactionsfromasetofoptionsor arms,withuncertainrewardsassociatedwitheacharm.In other words, the agent doesn't know the true reward distribution of each arm, but it can explore by taking differentactionsandobservingtheresultingrewards.

The goal of the agent is to maximize the total reward it receives over a certain period, or to minimize the regret, whichisthedifferencebetweentheexpectedrewardifthe agenthadchosentheoptimalarmfromthebeginningand theactualrewardobtained.

There are several algorithms to solve stochastic bandit problems, including the epsilon-greedy, UCB (Upper ConfidenceBound),andThompsonsampling.Epsilon-greedy is a simple algorithm that selects the best arm with probability 1-epsilon and a random arm with probability epsilon.UCBisamoresophisticatedalgorithmthatbalances exploration and exploitation by selecting arms with the highestupperconfidencebounds.Thompsonsamplingisa Bayesian algorithm that randomly samples reward distributions for each arm and selects the arm with the highestexpectedrewardaccordingtothesamples.

Stochastic bandits have numerous applications in various fields,includingrecommendersystems,onlineadvertising, andclinicaltrials.

4. LITERATURE REVIEW

In thesectionoftheliteraturereview,wehavestudiedthe previousresearchpaperrelatedtoThompsonsamplingfor different decision-making, the summary of all previous researchpapersisgivenbelow:

Joseph: Everygenuinelyautonomoussystemmusthavethe capacitytomakechoices.Oneapproachtofindingsuchways is via experience-based learning of decision-making strategies, since this kind of learning may help uncover strategiesthatarerobustagainsttheinfluenceofnoiseand flexibleenoughtobeusedinavarietyofsettings.Themultiarmed bandit problem and the best arm identification problem have been used as starting points for our explorationofthisarea(andrelatedvariations).Aftermuch

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page366

Figure-1: Stochastic Bandits

trialanderror,we'vefoundthatThompsonSamplingisan effective method for dealing with the multi-armed bandit algorithm.ThissolutionisoptimalfortheBernoulliarmed bandit issue because it meets the lower restriction on anticipatedcumulativeregret.Inmanycircumstances,the cost of computing is minimal. By using conjugate priors, MCMC,orvariationalmethods,ThompsonSamplingmaybe used to model a broad range of decision-making settings, includingcontextualbanditissues.Itisadaptabletoinclude cross-handdependenciesandisresistanttorewarddelays. To extend the evidence of Agrawal and Goyal to the optimistic variant of the approach, we make certain adjustmentstooptimismsuchthatitalsofulfillsthelower limitforcumulativeregret.Asanadditionalfinding,wefind that the Normal Thompson Sampling approach (one that employs the Normal distribution as the conjugate prior) outperforms the more standard Thompson Sampling strategy(whichemploysaBetaprior)intermsofempirical performance.

Mellor, Jonathan: ManymethodsthatcombineThompson SamplingwithChangePointdetectionhavebeenexploredin this work. In their designated roles, we have shown their utility.Banditscenariosusingreal-worlddatasetslikethe Yahoo!datasetandtheForeignExchangefurtherhighlight their usefulness. They are shown to fail the PASCAL challenge when pitted against comparably complex but carefully calibrated competing algorithms. Our research, however, lends credence to a pair of methods that are analogoustoAdapt-EvEbutonlykeeptabsonchangestothe "best" arm: Global-CTS2 and PA-CTS2. Due to its adaptability, the model holds out possibilities for further improvementviatheintroductionofadditionalassumptions. It'salsoworthnotingthatotherrewarddistributionsbeyond Bernoulli'sarecompletelyOK.Falsealarmsareacommon issuewithchangepointdetectionmethods,butaBayesian approachmighthelpavoidthem.Giventhattheyarederived fromfundamentalmodels,thealgorithmshavetheoretical support; nonetheless, further research is needed before a complete theoretical account of their performance can be provided.

Alami: We propose Global-STS-CF, an extension of the Switching Corrupt Bandit Problem using three distinct Thompson Sam plings. The experimental results demonstrate the superior performance of the suggested method. Notably, Global-STS-CF competes with an oracle, CorruptedFeedback,thatisawareoftheinflectionpointsin advance,posingachallengetoThompsonsampling.These outcomesareadirectconsequenceofGlobal-STS-foundation CF's in the Bayesian idea of following the most knowledgeable individuals throughout the world, which enablesustodetectandreacttoshiftsinthelandscapewith remarkableefficiency.Bykeepinganexpertdistributionona per-arm basis, the suggested approach may be easily adaptedtothePer-arm SwitchingCorruptedMulti-Armed

Bandit.Next,we'llexaminetheGlobal-STS-CFviathelensof pseudo-cumulativeregret.

Gourab: To keep tabs on the dynamic two-armed bandit problem environment, we investigate a change-detectionbased Thompson Sampling approach (TS-CD). We have establishedtheminimumstationaryregimetimewindowfor TS-CD,makingitpossibletodetectthetransitionsassoonas they occur. Our results show that the proposed TS-CD algorithmconvergestoasymptoticoptimalityovera wide rangeofupdaterates.Totesttheefficacyofthestrategy,we apply it to the RAT selection problem at the wireless network'sedge.WehaveshownthatTS-CDissuperiortothe standard max power band selection method and other banditalgorithmsdesignedforunpredictableenvironments.

Maarten: WehaveinvestigatedtheOLTRissueinadynamic settingwhereusertasteschangequickly.Here,weprovide cascading non-stationary bandits, an online-learning variation of the widely-used cascade model (CM) for predictingusers'clickbehaviors.Ithasbeensuggestedthat thealgorithmsCascadeDUCBandCascadeSWUCBbeusedto solveit.Theyareproventoexperiencesub-linearremorse byourtheoreticalanalysis.Ourexperimentalresultsonthe Yandex click dataset corroborate these theoretical predictions.Manynewpathsforthedevelopmentofmobile OLTR areopenedup. To begin with, wehavejust thought abouttheCMconfiguration.Futureresearchshouldtakeinto accountotherclickmodelssuchasDBN[ChapelleandZhang, 2009]thatcanprocessmultipleclicks.Thesecondarea of interestwasUCB-basedpolicy.Onealternativeistoapplya policyfromthesoftmaxfamily[Besbesetal.,2014].Inthis direction,itispossibletogetupperlimitsthatareinsensitive tothenumberoftransitions.

Giuseppe: Weputforwardanovelbanditmethodtosolve the issue of learning in dynamic settings. The algorithm demonstrated superior performance over state-of-the-art solutions and the ability to adapt to various patterns of nonstationarity.All-Seasonisalsofareasiertouseandmore sturdythanitscompetitors,makingitwell-suitedforusein industrialsettingsandrequiringlessupkeep.Wespeculate thatthereismoreworktobedoneinaddressingthemodel misspecificationissueandchoosingwhichmodelsshouldbe eliminated. In particular, we think that constructing the posterior prediction weights using the General Bayes technique(Knoblauchetal.,2019)mightbeamorerobust option.

Gourab et.al: To account for the unpredictability of realworld environments, we provide a KS-test-based change detectiontechniqueinthecontextoftheMulti-Agent-Based (MAB) architecture. We use our KS-test-inspired, actively adaptableTSalgorithm,TS-KS,fortheMABproblem.TS-KS hasasub-linearregretinthetwo-armedbanditproblem.It's importanttonotethattheproposedapproachmaydetecta change even when more sophisticated methods based on mean estimates fail. As seen in two instances, we

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page367

demonstratethattheTS-KSalgorithmoutperformsboththe activelyadaptiveTS-CDtechniqueandthepassivelyadaptive D-TS strategy. To add, the results of the portfolio optimization case study demonstrate that TS-KS is competitivewithotherleadingforecastingalgorithms,like Facebook-PROPHETandARIMA.

Zhang: Due to the inherent uncertainty of real-world settings, we provide a KS-test-based change detection approach within the framework of the Multi-Agent-Based (MAB) architecture. As an application of our KS-testinspired,activelyadaptiveTSalgorithm,TS-KS,wetacklethe MAB issue. In the two-armed bandit dilemma, TS-KS experiences a sub-linear regret. Note that the suggested technique may succeed when more advanced approaches based on mean estimation fail to spot a shift. Using two examples, weshow that the TS-KS algorithm outperforms the active TS-CD method and the passive D-TS approach. Furthermore, the portfolio optimization case study shows that TS-KS can hold its own against other top forecasting algorithmslikeFacebook-PROPHETandARIMA.

5. CONCLUSION

Afterstudyingtheaboveliteraturereview,wefoundsome conclusion, that is given here. Thompson sampling, also known as the Bayesian bandit algorithm, is a popular algorithm used for decision-making problems in various fieldssuchasmarketing,medicine,andengineering.Itisa probabilistic algorithm that balances exploration and exploitationtofindtheoptimalsolutionforagivenproblem. Thompson sampling can be used to optimize decisionmakinginroboticsapplicationssuchasautonomousvehicles anddrones.Thompsonsamplingcanbeusedtooptimizethe recommendation process in recommender systems by choosing the most relevant items to recommend to users. Thompsonsamplingcanbeusedtooptimizethedesignof clinical trials by efficiently allocating patients to different treatmentgroups.

REFERENCE

[1] H. Mohanty, A. U. Rahman, and G. Ghatak. (2020) Thompson Sampling GoF. [Online]. Available: https://github.com/hardhik-99/ThompsomSamplingGoF/

[2] D. Lee, N. He, P. Kamalaruban, and V. Cevher, “Optimization for reinforcement learning: From a single agent to cooperative agents,” IEEE Signal Processing Magazine,vol.37,no.3,pp.123–135,2020.

[3] D. Bouneffouf and I. Rish, “A survey on practical applicationsofmulti-armedandcontextualbandits,”arXiv preprintarXiv:1904.10040,2019.

[4]P.EnglertandM.Toussaint,“Combinedoptimizationand reinforcementlearningformanipulationskills.”inRobotics: Scienceandsystems,vol.2016,2016.

[5] G. Burtini, J. Loeppky, and R. Lawrence, “A survey of onlineexper imentdesignwiththestochasticmulti-armed bandit,”arXivpreprintarXiv:1510.00757,2015.

[6] A. Lesage-Landry and J. A. Taylor, “The multi-armed bandit with stochastic plays,” IEEE Transactions on AutomaticControl,vol.63,no.7,pp.2280–2286,2017.

[7] S. Li et al., “Collaborative filtering bandits,” in Proceedingsofthe39thInternationalACMSIGIRconference on Research and Development in Information Retrieval, 2016,pp.539–548.

[8] S. Buccapatnam, F. Liu, A. Eryilmaz, and N. B. Shroff, “Rewardmaximizationunderuncertainty:Leveragingsideobservationsonnetworks,”TheJournalofMachineLearning Research,vol.18,no.1,pp.7947–7980,2017.

[9]A.U.RahmanandG.Ghatak,“Abeam-switchingscheme forresilientmm-wavecommunicationswithdynamiclink blockages,”inIEEEWiOpt,2019.

[10]Q.ZhuandV.Y.Tan,“Thompsonsamplingalgorithms for mean-variance bandits,” arXiv preprint arXiv:2002.00232,2020.

[11] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis, “Parallel gaussian process optimization with upper confidenceboundandpureexploration,”inJointEuropean ConferenceonMachineLearningandKnowledgeDiscovery inDatabases.Springer,2013,pp.225–240.

[12]W.R.Thompson,“Onthelikelihoodthatoneunknown probabilityexceedsanotherinviewoftheevidenceoftwo samples,”Biometrika,vol.25,no.3/4,pp.285–294,1933.

[13] O.-C. Granmo, “Solving two-armed bernoulli bandit problems using a bayesian learning automaton,” International Journal of Intelligent Computing and Cybernetics,2010.

[14] S. Agrawal and N. Goyal, “Analysis of thompson sampling for the multiarmed bandit problem,” in Conferenceonlearningtheory,2012,pp.39–1.

[15] A. Garivier and E. Moulines, “On upper-confidence bound policies for non-stationary bandit problems,” arXiv preprintarXiv:0805.3415,2008.

[16]O.Besbesetal.,“Stochasticmulti-armed-banditproblem with nonstationary rewards,” in Advances in neural informationprocessingsys tems,2014,pp.199–207.

[17] C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud, and M. Sebag, “Multiarmed bandit, dynamic environments and meta-bandits,”2006.

[18] J. Y. Yu and S. Mannor, “Piecewise-stationary bandit problemswithsideobservations,”inProceedingsofthe26th

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page368

annualinternationalconferenceonmachinelearning,2009, pp.1177–1184.

[19] Y. Cao, Z. Wen, B. Kveton, and Y. Xie, “Nearly optimal adaptive procedure with change detection for piecewisestationarybandit,”inThe22ndInternationalConferenceon ArtificialIntelligenceandStatistics,2019,pp.418–427.

[20]O.Besbesetal.,“Optimalexploration-exploitationina multi-armedbanditproblemwithnon-stationaryrewards,” StochasticSystems,vol.9,no.4,pp.319–337,2019.

[21] J. Mellor and J. Shapiro, “Thompson sampling in switching environ ments with bayesian online change detection,”inArtificialIntelligenceandStatistics,2013,pp. 442–450.

[22] V. Srivastava, P. Reverdy, and N. E. Leonard, “Surveillanceinanabruptlychangingworldviamultiarmed bandits,”in53rdIEEEConferenceonDecisionandControl. IEEE,2014,pp.692

697.

[23] G. Ghatak, “A change-detection based thompson sampling framework for non-stationary bandits,” IEEE TransactionsonComputers,pp.1–1,2020.

[24]P.Auer,P.Gajane,andR.Ortner,“Adaptivelytracking thebestbanditarmwithanunknownnumberofdistribution changes,”inConferenceonLearningTheory,2019,pp.138–158.

[25] S. S. Villar et al., “Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges,” Statistical science: a review journal of the Institute of MathematicalStatistics,vol.30,no.2,p.199,2015.

[26] K. Jahanbakhsh, “Applying multi-armed bandit algorithmstocomputational advertising,” arXivpreprint arXiv:2011.10919,2020.

[27]V.W.BergerandY.Zhou,“Kolmogorov–smirnovtest: Overview,”Wileystatsref:Statisticsreferenceonline,2014.

[28]P.Massart,“Thetightconstantinthedvoretzky-kieferwolfowitzinequality,”TheannalsofProbability,pp.1269–1283,1990.

[29] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge universitypress,2014.

[30]N.-N.Dao,T.-T.Nguyen,M.-Q.Luong,T.Nguyen-Thanh, W. Na, and S. Cho, “Self-calibrated edge computation for unmodeledtime-sensitiveiotoffloadingtraffic,”IEEEAccess, vol.8,pp.110316–110323,2020.

[31]A.U.Rahman,G.Ghatak,andA.DeDomenico,“Anonline algorithm for computation offloading in non-stationary

environments,”IEEECommunicationsLetters,vol.24,no. 10,pp.2167–2171,2020.

[32] F. Black and R. Litterman, “Global portfolio optimization,”Financialanalystsjournal,vol.48,no.5,pp. 28–43,1992.

[33]D.Gawlik.(2017)NewYorkStockExchange:S&P500 companieshistoricalpriceswithfundamentaldata.[Online]. Available:https://kaggle.com/dgawlik/nyse

[34] H. A. Al-Zeaud, “Modelling and forecasting volatility usingarimamodel,”EuropeanJournalofEconomics,Finance &AdministrativeScience,vol.35,pp.109–125,2011.

[35]S. J.TaylorandB.Letham,“Forecastingatscale,”The AmericanStatistician,vol.72,no.1,pp.37–45,2018.

International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 10 Issue: 04 | Apr 2023 www.irjet.net p-ISSN: 2395-0072 © 2023, IRJET | Impact Factor value: 8.226 | ISO 9001:2008 Certified Journal | Page369

–

Turn static files into dynamic content formats.

Create a flipbook