Skip to main content

NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A REVIEW

Page 1

International Research Journal of Engineering and Technology (IRJET)

e-ISSN: 2395-0056

Volume: 10 Issue: 04 | Apr 2023

p-ISSN: 2395-0072

www.irjet.net

NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM: A REVIEW Md Arif1, Mr. Nadeem Ahmad2 1M.Tech, Electronic and Communication Engineering, GITM, Lucknow, India

2Assistant Professor Electronic and Communication Engineering, GITM, Lucknow, India

---------------------------------------------------------------------***---------------------------------------------------------------------

Abstract - Thompson Sampling (TS) is a popular algorithm

the exploration of new actions with the exploitation of actions that have already been found to be rewarding.

used in multi-armed bandit problems. In the standard setting, it assumes that the rewards associated with each arm are stationary, which means that the reward distribution remains fixed throughout the experiment. However, in many real-world scenarios, the reward distribution can change over time, and this is known as a non-stationary bandit problem. In this case, the traditional TS algorithm may not perform well. To address this issue, several extensions of the standard TS algorithm have been proposed for non-stationary bandits. One such extension is the Bayesian Online Changepoint Detection (BOCD) algorithm. BOCD uses a Bayesian framework to model the changes in reward distribution and adjust the exploration and exploitation trade-off accordingly. The BOCD algorithm maintains a posterior distribution over the possible locations of the change points in the reward distribution. At each time step, it uses this posterior to compute the probability that a change point has occurred. If the probability of a change point is high, the algorithm explores more to adapt to the new reward distribution. Otherwise, it exploits more to maximize its expected reward.

One common example of the MAB problem is the slot machine or "one-armed bandit" problem. In this scenario, the agent must choose between pulling the levers of several slot machines, each of which has a different payout distribution. The agent must decide how many times to pull each lever to maximize its total payout. There are many variations of the MAB problem, each with different assumptions and objectives. One common approach is the epsilon-greedy algorithm, which chooses the action with the highest estimated reward with probability 1epsilon, and chooses a random action with probability epsilon. This balances the exploitation of known rewarding actions with the exploration of new actions. Other popular algorithms for the MAB problem include UCB1, Thompson Sampling, and EXP3. These algorithms differ in their assumptions about the distribution of rewards, and their strategies for balancing exploration and exploitation.

Another extension of the standard TS algorithm for nonstationary bandits is the Dynamic Thompson Sampling (DTS) algorithm. DTS uses a sliding window approach to detect changes in the reward distribution. The algorithm maintains a separate posterior distribution for each window and selects the arm with the highest expected reward based on the posterior distribution of the current window. In summary, Thompson Sampling is a powerful algorithm for the multiarmed bandit problem, and several extensions can be used to handle non-stationary bandits. These extensions allow the algorithm to adapt to changes in the reward distribution over time and continue to make optimal decisions.

The MAB problem has many applications in fields such as advertising, finance, and healthcare. For example, in online advertising, advertisers must decide which ads to display to maximize click-through rates while balancing the need to explore new ads with the need to exploit effective ads.

2. UPPER CONFIDENCE BOUND (UCB) ALGORITHM The Upper Confidence Bound (UCB) algorithm is a popular algorithm for multi-armed bandit problems, which are a class of sequential decision-making problems. In the multiarmed bandit problem, a decision maker must repeatedly choose between a set of actions (or "arms") with uncertain rewards, to maximize their total reward over time.

Key Words: Sampling, Algorithm for Non-Stationary Bandits, Stationary Bandits, detection based.

1. INTRODUCTION

The UCB algorithm works by balancing the exploration of different actions with their potential rewards. At each time step, the algorithm chooses the action with the highest upper confidence bound, which is a measure of the uncertainty of the estimated reward for that action.

The multi-armed bandit (MAB) framework is a classic problem in decision theory and reinforcement learning. It involves an agent (or decision-maker) who must choose between a set of actions (often called "arms"), each of which has an unknown reward distribution. The goal of the agent is to maximize its cumulative reward over time while balancing

© 2023, IRJET

|

Impact Factor value: 8.226

The upper confidence bound is calculated as the sum of two terms: the estimated reward for the action and a confidence interval term that takes into account the uncertainty in the

|

ISO 9001:2008 Certified Journal

|

Page 365


Turn static files into dynamic content formats.

Create a flipbook