International Research Journal of Engineering and Technology (IRJET)
e-ISSN: 2395-0056
Volume: 10 Issue: 06 | Jun 2023
p-ISSN: 2395-0072
www.irjet.net
NON-STATIONARY BANDIT CHANGE DETECTION-BASED THOMPSON SAMPLING ALGORITHM Md Arif1, Mr. Nadeem Ahmad2 1M.Tech, Electronic and Communication Engineering, GITM, Lucknow, India
2Assistant Professor Electronic and Communication Engineering, GITM, Lucknow, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - Many rigorous mathematical approaches and
of a gambler standing in front of a row of slot machines, each with a different payout rate. The gambler must choose which machines to play and in what order while trying to maximize their overall payout.
optimum solutions to the stationary multi-armed bandit (MAB) paradigm may be found in the literature. However, the MAB issue is famously difficult to analyse for a non-stationary environment, i.e. when the reward distribution fluctuates over time. There are two main methods that have been suggested to combat non-stationary bandit problems: i) passively adaptable methods, which can be easily analysed, and ii) actively adaptive methods, which monitor their surroundings and make adjustments in response to any changes they discover. Researchers have responded by developing new bandit algorithms that build on previously established methods, such as the sliding-window upper-confidence bound (SW-UCB), the dynamic upper-confidence bound (d-UCB), the discounted upper-confidence bound (D-UCB), the discounted Thompson sampling (DTS), etc. For this reason, we focus on the piecewise stationary setting, in which the reward distribution is held constant for some period of time before changing at some arbitrary moment. For this context, we offer the TS-CD family of change-detection-based, actively-adaptive TS algorithms. In specifically, a Poisson arrival process is used to mimic the non-stationary environment, which adjusts the reward distribution with each new arrival. We use the Kolmogorov-Smirnov test (KS-test) and the Anderson-Darling test (AD-test) as Goodness-of-fit tests to identify the shift. When TS detects a shift, it either updates the algorithm's parameters or penalises previous successes. We have conducted experiments on edge-control of i) multiconnectivity1 and ii) RAT selection in a wireless network to evaluate the efficacy of the proposed method. We have compared the TS-CD algorithms to other bandit algorithms including D-UCB, discounted Thompson sampling (DTS), and change detection-based UCB (CD-UCB), all of which are optimised for dynamic situations. We demonstrate the higher performance of the proposed TS-CD in the studied applications via comprehensive simulations.
In the MAB framework, each arm corresponds to a choice or action that can be taken, and the rewards associated with each arm are random variables with unknown distributions. The goal is to learn which arm(s) have the highest expected reward, based on a limited number of trials or samples. The challenge is to balance exploration (trying out different arms to learn their reward distributions) with exploitation (selecting the arm with the highest expected reward based on the current knowledge). Several algorithms have been developed for solving the MAB problem, including the epsilon-greedy algorithm, the upper confidence bound (UCB) algorithm, and Thompson sampling. These algorithms differ in their strategies for balancing exploration and exploitation. The MAB framework has applications in many fields, including online advertising, clinical trials, and recommendation systems. In each case, the problem involves making decisions with uncertain outcomes and limited resources. By using MAB algorithms to optimize decisionmaking, it is possible to improve performance and increase efficiency in a wide range of settings.
1.1. THOMPSON SAMPLING ALGORITHM Thompson Sampling is a popular algorithm for solving the multi-armed bandit problem. It is a Bayesian approach that balances exploration and exploitation by selecting actions according to their probabilities of being optimal, based on the observed data. The basic idea of Thompson Sampling is to model the reward distribution for each arm as a probability distribution, such as a beta distribution for binary rewards or a Gaussian distribution for continuous rewards. The algorithm then maintains a posterior distribution over the parameters of each distribution, which is updated as new data is observed.
Key Words: Decision-making, Contextual bandits, Probability distribution, Empirical performance, Convergence
1. INTRODUCTION
At each iteration, Thompson Sampling samples a set of parameters from the posterior distribution for each arm and selects the arm with the highest expected reward based on the sampled parameters. This balancing of exploration and exploitation is achieved by the probabilistic nature of the
The multi-armed bandit (MAB) framework is a classical problem in decision-making that involves making a series of choices among several alternatives (or "arms") with uncertain rewards. The problem gets its name from the idea
© 2023, IRJET
|
Impact Factor value: 8.226
|
ISO 9001:2008 Certified Journal
|
Page 619