# Problem with multi-armed bandit problem: explore vs. exploit dilemmas

A methodology which is called Thompson sampling per multi-armed bandit problem is illustrated here

§ 3.8 Example: Bayesian Bandits

In the description of this simulated experiment I cannot understand this instruction

`````` _, mxidx = findmax([sample_bandit(bandit) for bandit in bandits])
``````

which step of a hypothetical real experiment it corresponds to.

More precisely I am missing the real correspondence of the function of point 2 (`sample_bandit()` function in the code).

1. First, assign some probability distribution for your knowledge of the success probability of each slot machine.
2. ``````Sample randomly from each of these distributions and check which is the maximum sampled probability.
``````
3. ``````Pull the arm of the machine corresponding to that maximum value.
``````
4. ``````Update the probability with the result of the experiment.
``````
5. ``````Repeat from step 2.
``````

Wow I don’t like the author’s code style; very unidiomatic It doesn’t help that `beta_bandit` (which should be in CamelCase btw) encapsulates data pertaining to the bandit (`p`) as well as data pertaining to observations of (and thus, beliefs about) the bandit (`a`, `b`, `N`)—these are not intrinsic to the bandit, but to our experience with it. The bandit is a conceptually distinct object from our beliefs about it; to avoid confusion, they should not be the same structure!

When you trial the bandit, you build experience by keeping a running score of `a` (# of wins) and `b` (# of losses), and the purpose of this is to build a belief of what odds you should *expect* to have when playing this bandit, based upon your (randomized) observations. Hopefully you will correctly infer which is the winning bandit and play that one the most, without wasting too much time on losing bandits.

So what are a, b, and the beta distribution? These are not properties of the bandit; these are properties of your observations. They reflect the observed trial results, and thus the odds of winning or losing you can infer (i.e. believe) yourself to have, as well as the degree you should assign to this belief based on your accumulated experience. (with no experience the distribution will be broad, but with experience the distribution gets tighter and more confident—as you iteratively form new posterior distributions by conjugating new observations with previous priors.)

To answer your question: What does it mean to sample this distribution? In essence, it means to simulate what the bandit payout could be, in a randomized fashion incorporating your previous experience, in a way that isn’t just a binary “win” or “lose.” Then you pick which bandit you simulated to be best, try it, and update your belief distribution with the new observation accordingly. Initially this algorithm will explore all the bandits evenly, but over time it will build firmer beliefs about which bandits are better and try to exploit those more.

Thank you very much for the explanation and the time dedicated to deepening.
Let’s see if I understand something.
I try to imagine the situation as someone might do who knows nothing about distribution functions and the like (this could be me with probability 1).
To choose which bandit to try, at each experiment I roll three dice each associated with a specific bandit.
Then I choose the “winning” die arm.
Initially the dice have the same “geometry”. After each experiment I update the “geometry” of the dice (trick the die) in order to increase the result of the next roll, if the experiment was successful (or decrease if it was unsuccessful).

1 Like

If the question is about how Thompson sampling works, specifically, step (2) in the OP, then I can give it a shot:
The goal is to pull the best arm, but it isn’t known which is the best arm. So the algorithm simulates the missing knowledge about each arm, given its model of the arm success probability and knowledge from past pulls. The result of the simulation (sampling) is a success parameter for each arm. It then “really” pulls only the arm with the highest success parameter.
This algorithm is beautiful in its simplicity and in the fact it converges to the best arm quite optimally.

2 Likes

Much clearer in these terms. As observed by @uniment, in the script it is not well distinguished which part of the code simulates what one would “really” do and which part of the code is by itself part of the experiment.