# What is Thompson sampling?

by Stephen M. Walker II, Co-Founder / CEO

## What is Thompson sampling?

Thompson sampling is like a smart betting strategy for slot machines, where the machine learns which slots are luckier as you play. It smartly guesses which pull might win next, based on wins and losses from previous rounds.

Thompson sampling is a heuristic algorithm for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It involves selecting the action that maximizes the expected reward with respect to a randomly drawn belief. The algorithm maintains a distribution over the space of possible actions and updates this distribution based on the rewards obtained.

Named after William R. Thompson, Thompson sampling consists of choosing the action that maximizes the expected reward with respect to a randomly drawn belief. The algorithm was originally described by Thompson in 1933 and has since been rediscovered numerous times independently in the context of multi-armed bandits.

Key elements of Thompson sampling include:

• A likelihood function
• A set of parameters of the distribution
• A prior distribution
• Past observations

Thompson sampling has been applied in various domains, such as online advertising, clinical trials, and recommendation systems. The algorithm is particularly well-suited for problems where the space of possible actions is large or unknown, and exploration is costly.

## How does Thompson sampling work?

Thompson Sampling is a reinforcement learning algorithm used to address the exploration-exploitation dilemma in sequential decision-making problems, such as the multi-armed bandit problem. The algorithm balances between exploiting what is known to maximize immediate performance and investing to accumulate new information that may improve future performance.

The basic idea of Thompson Sampling is to choose an action according to its probability of being the best action. It maintains a distribution over the space of possible actions, and at each timestep, the algorithm samples an action from this distribution and takes that action. The distribution is then updated based on the reward that is received.

Here's a simplified step-by-step process of how Thompson Sampling works:

1. Start with a prior distribution over the parameters. This represents the initial beliefs about the environment.
2. For each round, sample a set of parameters from the prior distribution.
3. Perform an action based on these sampled parameters and observe the reward.
4. Update the prior distribution based on the observed reward. This becomes the posterior distribution which will be used in the next round.
5. Repeat the process.

In the context of the multi-armed bandit problem, each arm or slot machine has a reward distribution. The algorithm starts with a prior belief about these distributions. When an arm is pulled, the reward is observed and the belief about that arm's reward distribution is updated. The arm to pull in the next round is chosen by sampling from these updated distributions, selecting the arm with the highest sampled value.

Thompson Sampling has been used in various practical applications such as product recommendation, active learning with neural networks, and reinforcement learning in Markov decision processes. It's also used in industry for tasks like optimizing website layouts, improving recommendation systems, and enhancing the quality of video uploads.

One of the key advantages of Thompson Sampling is its ability to decrease the search as more information is gathered, which mimics the desirable trade-off between exploration and exploitation. It's also computationally efficient and can handle complex information structures.

## What are the benefits of using Thompson sampling?

Thompson Sampling is a popular method used in reinforcement learning and multi-armed bandit problems. It offers several benefits:

• Minimizing Cumulative Regret — Thompson Sampling is widely used in industry as it is one of the best methods to allocate arms in terms of minimizing cumulative regret. Cumulative regret is the difference between the mean reward of the optimal arm and the rewards of the arms played.

• Reduced Exploration — Thompson Sampling can help reduce the amount of exploration needed to find the optimal policy. This is because it samples from a distribution that is close to the true distribution of optimal actions, making the sampled action more likely to be optimal.

• Avoidance of Local Optima — The algorithm constantly re-evaluates the space of possible reward functions, which can help avoid local optima.

• Simplicity and Extensibility — Thompson Sampling is relatively simple to implement and can be easily extended to work with more complex environments. For example, it can be extended to work with non-stationary environments using a dynamic programming approach.

• Handling Delayed Feedback — Unlike deterministic algorithms like Upper Confidence Bound (UCB), Thompson Sampling is a probabilistic algorithm that can accommodate delayed feedback. This means you can update the dataset for your multi-armed bandit problem in a batch manner, saving additional computing resources or the cost of updating the dataset each time.

• Strong Empirical Performance — Thompson Sampling has seen a surge of interest among industry practitioners and academics due to its strong empirical performance. It has been successfully applied in a wide variety of domains, including revenue management, marketing, and website optimization.

However, like all algorithms, Thompson Sampling has its own set of potential drawbacks. It's important to understand these limitations and consider them when deciding whether to use Thompson Sampling in a specific context.

## What are some potential drawbacks of using Thompson sampling?

Thompson sampling, while a powerful tool for addressing the exploration-exploitation dilemma in reinforcement learning, does have some potential drawbacks:

1. Computational Intensity — Thompson sampling can be computationally intensive, especially in large or complex environments. This can make it less suitable for applications where computational resources are limited or where decisions need to be made quickly.

2. Bias Towards Exploration — Thompson sampling can be biased towards exploration, which may slow down the convergence to the optimal solution compared to other algorithms.

3. Dependence on Prior Distributions — The performance of Thompson sampling can be significantly influenced by the choice of prior distributions. If the priors are not well-chosen, the algorithm may perform poorly. This is particularly relevant when dealing with non-binary or non-standard reward distributions.

4. Approximate Sampling — There is a lack of understanding of how approximate sampling affects the regret guarantees of Thompson sampling. Traditional treatments of the algorithm often assume that the prior distributions and the reward distributions are conjugate pairs, which may not always be the case.

5. Suboptimal for Best Arm Identification — If the aim is to identify the best arm (i.e., the action with the highest expected reward), Thompson sampling may not be the most efficient method. Other methods, such as Upper Confidence Bound (UCB) algorithms, may perform better in this regard.

It's important to note that despite these potential drawbacks, Thompson sampling is still widely used due to its simplicity, flexibility, and effectiveness in many scenarios. The choice of whether to use Thompson sampling should be informed by the specific requirements and constraints of the problem at hand.

## How can Thompson sampling be used in AI applications?

Thompson Sampling, a Bayesian method in reinforcement learning, chooses actions by sampling from a distribution over possible reward functions, reflecting current beliefs and enabling informed exploration. This approach is efficient in both bandit problems and broader reinforcement learning contexts, though it may lead to suboptimal exploration in the latter.

Its practical effectiveness is evidenced by its use in industry: Doordash optimizes Dasher messaging by learning responsiveness patterns; Amazon increased website conversion by 21% in a week by selecting optimal layouts; Facebook's Constrained Thompson Sampling enhances video upload quality; and Netflix incorporates it into their recommendation systems.

As a tool for AI applications, Thompson Sampling excels in reinforcement learning and decision-making by effectively balancing exploration with exploitation, avoiding local optima, and handling complex environments.

## More terms

### What is Federated Learning?

Federated Learning is a machine learning approach that allows a model to be trained across multiple devices or servers holding local data samples, without exchanging them. This privacy-preserving approach has the benefit of decentralized training, where the data doesn't need to leave the original device, enhancing data security.

### What is robotics?

Robotics is a branch of technology that deals with the design, construction, operation, structural disposition, manufacture, and application of robots. This field overlaps with electronics, computer science, artificial intelligence, mechatronics, nanotechnology, and bioengineering. Robots are automated machines that can aid humans in a variety of tasks, ranging from industrial manufacturing to intricate surgical procedures. They also have substantial applications in the areas of space exploration, transportation, safety, and mass commodity production. Robotics is constantly evolving and is a key component of modern technological advancements.