The hardened Machine Learning professional knows that there are three key branches of ML: supervised learning, unsupervised learning and reinforcement learning. In the latter, agents learn to translate state and possibly other internal knowledge into decisions - which impact state, and by consequence, influence the agent's next decision.
In a way, this is how humans operate.
The class of Multi-Armed Bandits is a simple way of looking at Reinforcement Learning. In this article, we're going to take a look at a simple form of these bandits - the A/B/n testing scenario. This is a generalization of A/B testing to multiple choices. It's simple because it does not use state - it only learns from rewards for a particular action in the past. In that way, it's also single-step; it does not look beyond the decision that is currently to be made. However, while simplicity gets you up to speed quickly, there are drawbacks too. We're going to cover everything - including a step-by-step Python example for implementing your own A/B/n test.
Are you ready? Let's take a look! đ
Before we're going to write code, let's take a look at what a Multi-Armed Bandit or MAB problem actually is:
[A multi-armed bandit problem] is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.
Wikipedia (2005)
Quite a complex definition! Broken apart, it means that:
We've now arrived at the exploration/exploitation dilemma that is continuously present within Reinforcement Learning. Recall from the definition that we always want to make the best possible choice. However, at the beginning, we don't even know what the best possible choice is. We must learn that! In other words, we must first explore all possible choices, until we can exploit our knowledge with a reasonable amount of certainty.
Where to break off exploration in favor of exploitation, or even choosing to do so (it's also possible to mix both together in multiple ways, as we shall look at in further articles about Δ-greedy MABs and Thompson Sampling) must be set in a good way. In essence, this is also a MAB problem in itself :)
In the remainder of this article, we're going to work on creating our own A/B/n test. We do so with an advertising setting, which is actually a very common setting for A/B/n tests. Suppose that we have a website (like MachineCurve) where ads must be shown. We don't know anything about the user and we don't collect any data in order to understand the user better. Still, we'd love to maximize our revenue, which can be done by maximizing the amount of clicks per amount of views - or the click-through rate (CTR).
This is a classic MAB problem. At the start, we simply have three ads, and we don't know anything about their performance. We'll have to explore their performance first, before we can show the best ad to our visitors, and exploit our knowledge to maximize our revenue.
Suppose that these are the three ads. To allow comparison afterwards, we model them to have Binomial distributions with success probabilities (or CTRs) of 1%, 2.4% and 3%, respectively. A Binomial distribution effectively represents a trial (in our case, just one - showing the ad) where the outcome is (1 - success - click) with some probability p
, and (0 - fail - no click) with probability 1 - p
. In other words, for our three ads, the odds that someone clicks on the ad will be 1%, 2.4% and 3%, respectively, but we don't know this officially.
Let's know take a look at modeling this with Python.
Building a MAB for A/b/n testing with Python involves the following steps:
SimpleAdvertisement
class which can be shown.SimpleAdvertisements
.The first step would be to import our dependencies. For today's code, you'll only rely on numpy
for numbers processing and matplotlib
for visualizing the outcomes. Both can be installed with pip
, through pip install numpy
, for example.
import numpy as np
import matplotlib.pyplot as plt
SimpleAdvertisement
classThe next step is to represent the blueprint of an advertisement. If you look at what such an ad should do - it should display. Displaying should return a reward, which is either a click or no click.
Recall that we generate clicks following a Binomial distribution which returns 1 (success) with probability p
. For that reason, we allow configurable p
s to be passed in the constructor, which we then use to generate a reward when the advertisement is shown in show
.
class SimpleAdvertisement():
""" Representation of a simple advertisement."""
def __init__(self, p):
"""
Constructor. Set p value for the binomial distribution
that models user click behavior for this advertisement.
A p-value represents the odds of a click, 0 <= p <= 1.
"""
self.p = p
def show(self):
"""
Fictitiously show an advertisement. Return a reward:
either 0 (no click) or 1 (click). Draw just once (n)
and draw successfully (click) with probability p.
"""
return np.random.binomial(n=1, p=self.p)
Let's now generate three ads and put them in a list. Note that each advertisement has a different parameter: 0.01
for ad 1, 0.024
for ad 2 and 0.03
for ad 3. Indeed, these are the CTRs that we saw above - or in other words, the p values
which represent the probability that our binomial sample returns a 1
(click).
# Generate the advertisements
advertisement_one = SimpleAdvertisement(0.01)
advertisement_two = SimpleAdvertisement(0.024)
advertisement_three = SimpleAdvertisement(0.03)
advertisements = [advertisement_one, advertisement_two, advertisement_three]
Now that you have created the advertisements, you can set the global variables for our A/B/n test. The number of tests represents the number of exploration iterations in which the best ad is chosen. The number of production runs represents the number of subsequent exploitation iterations in which we continue with the chosen ad, to find a final score. Of course, we hope that this score approximates the CTR of 0.03
, which is that of our best-performing ad.
Number of ads is simply the number of advertisements created - three. The average rewards over time list is used for storing the average reward after every exploration/exploitation step, so that we can generate a plot later. The reward sum is indeed a sum of all the rewards, and the N_impres
represents the number of impressions.
What's left is the Q_values
. This is an important term in Reinforcement Learning problems and hence also in MAB problems. A Q-value, also called action value, represents a weighted average of all rewards over time, and is a measure of how well a certain choice performs. You'll see that our task involves picking the ad with the highest Q-value!
# Set the scenario's variables
num_tests = 12500
num_prod = 50000
num_ads = len(advertisements)
average_rewards_over_time = []
N_impres = np.zeros(num_ads, dtype=np.int)
Q_values = np.zeros(num_ads)
reward_sum = 0
We've now arrived at the exploration phase. In this phase, we run our tests and continuously update the Q values given the reward we received after picking an advertisement, so that we can pick the best-performing advertisement later.
In the code below, this happens:
def a_b_n_test(num_test, ads):
""" Run A/B/n testing phase. """
global reward_sum
# Iterate over the test range.
for test in range(num_test):
# Pick an advertisement at random.
chosen_ad_idx = np.random.randint(len(ads))
chosen_ad = ads[chosen_ad_idx]
# Observe reward for advertisement [click = 1, no click = 0]
reward = chosen_ad.show()
# Increase counter for ad and Q/action value
N_impres[chosen_ad_idx] = N_impres[chosen_ad_idx] + 1
Q_values[chosen_ad_idx] += (1 / N_impres[chosen_ad_idx]) * (reward- Q_values[chosen_ad_idx])
# Increase total reward
reward_sum += reward
average_reward_so_far = reward_sum / (test + 1)
average_rewards_over_time.append(average_reward_so_far)
In the exploitation phase, we have selected an ad which we can now run in production. We'll actually pick the best ad from the exploration phase when merging everything together a bit more below, but let's take a look at the code for running the advertisement in production first.
def a_b_n_prod(num_prod, best_ad):
""" Run the best ad in production. """
global reward_sum
# Iterate over the test range.
for prod in range(num_prod):
# Observe reward for advertisement [click = 1, no click = 0]
reward = best_ad.show()
# Increase total reward
reward_sum += reward
average_reward_so_far = reward_sum / (prod + num_tests + 1)
average_rewards_over_time.append(average_reward_so_far)
As you can see, we run a number of iterations - num_prod
. Using the best advertisement, we observe another reward, with which we increase the reward sum and add the average rewards over time.
Before we merge everything together, there's only one thing left - and that is merging everything together.
As you can see below, using matplotlib
, we plot the average rewards over time. Precisely at the cutoff point between the exploration and exploitation phases (which in our case is after num_tests = 12500
iterations), we draw a vertical line and write some text, to indicate the change in behavior. Finally, we set a title.
def plot_phases(selected_ad, selected_ad_score):
"""
Plot the reward for the exploration and exploitation phases.
"""
plt.plot(average_rewards_over_time)
plt.axvline(x=num_tests, linestyle='--', color='gray') # Plot vertical line at cutoff for exploration
plt.text(num_tests-6000, 0, 'Exploration', color='gray')
plt.text(num_tests+2000, 0, 'Exploitation', color='gray')
plt.title(f"Average reward over time - Selected ad {selected_ad} (score: {selected_ad_score}) - Best ad: 3 (score: 0.03)")
plt.show()
Next up is merging everything together. In the ad_scenario()
def, you'll actually create the scenario, and run the exploration and exploitation phases. As you can see:
a_b_n_test
or exploration scenario on all advertisements.best_ad_index
and run it in production for num_prod
iteratinos.def ad_scenario():
"""
Run an advertisement based A/B/n Multi-Armed Bandit scenario.
Select the best out of three ads, specified above, then run
in production. The proper ad was chosen if the average reward
over time approximates the highest p value (0.03) chosen
with the advertisements at the top of this code.
"""
# 1. Run A/B/n test (exploration) on advertisements
a_b_n_test(num_tests, advertisements)
# 2. Pick best ad after testing
best_ad_index = np.argmax(Q_values)
print("="*50)
print(f"Best-performing advertisement after exploration is Ad {best_ad_index+1}")
print("="*50)
print(f"Score board:")
for i in range(len(advertisements)):
print(f"> Ad {i+1} - {Q_values[i]}")
print("="*50)
# 3. Run the ad in production
a_b_n_prod(num_prod, advertisements[best_ad_index])
# 4. Print overall ad performance
print(f"Global average reward over time: {average_rewards_over_time[-1]}")
print("="*50)
# 5. Plot the performance
plot_phases(best_ad_index, Q_values[best_ad_index])
if __name__ == '__main__':
ad_scenario()
Now, it's time to run the code. Quite quickly, you should observe the following:
After exploration, advertisement 2 was the best-performing advertisement (with a CTR of 0.0288
or 2.88%). Recall that the best ad is actually advertisement 3, with a CTR of 3%. In this case, while the CTR is close, you could have done better!
This is one of the drawbacks of a simple A/B/n test: you have to configure the length of the exploration phase yourself. It can be too long, and then you waste precious amounts of time selecting a candidate that could have already been running in production. It can also be too short, which is the case above, and then you lose money - because you wanted to be too fast.
While quick, it's also dirty, and another drawback of this method is that selected choices (in this case, our selected advertisement) can no longer be changed in the exploitation phase. Indeed, once chosen, you're stuck with a potentially underperforming choice. Finally, during exploration, you'll quickly note that some ads perform quite poorly. As you select advertisements at random during that phase, you'll keep seeing these ads come by - even though you know that they perform worse than your top candidates. With a regular A/B/n test, you cannot drop such ads, unnecessarily lengthening your exploration phase.
Methods like Δ-greedy MABs and Thompson Sampling help you make better choices regarding the exploration/exploitation trade-off. We'll cover them in future articles.
Wikipedia. (2005, October 7). Multi-armed bandit. Wikipedia, the free encyclopedia. Retrieved October 4, 2021, from https://en.wikipedia.org/wiki/Multi-armed_bandit
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.