Multi-Armed Bandit (k-Armed Bandit)#

References#

[1]:

from datetime import timedelta, datetime
import time

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from IPython.display import display, clear_output

Introduction#

Iteratively selects one of the multiple fixed choices, These choices are called arms or actions or levers
Property of each choice is partially known at the time of allocation but become better understood as the time passes.
- Either we know it earlier (Partially)
- or we’ll know it with time
these arms are Independent
This algorithm works as Best Arm Identification
Target here is to Minimize Regret
Exploration \(\leftrightarrow\) Exploitation

Objective > Maximize the sum of rewards earned through a sequence of arm pulls. > > Each machines provides a random reward from a probability distribution specific to the machine(not known apriori)

Reward Models#

Reward	Description
IID(Independent & Identically Distributed) Rewards	Reward from each arm is drawn independently from a fixed distribution that depends on the `arm` and not the `round t`
Adverserial Rewards	Rewards can be arbitrary as they are chosen by an `adversary` trying to fool the algorithm, May be oblivious or adaptive
Constrained Adversary	Rewards are chosen from an adversary that is subjected to some constraint
Random Process Rewards	An arm’s state, which determines reward, evolves over time as a random process state transition

Feedbacks#

Feedback	Description
Bandit Feedback	algorithm observes the reward only for chosen arm
Full Feedback	algorithm observes the reward for all the arms that could have been chosen
Partial Feedback	some info is revealed + reward from the chosen arm (doesn’t always amount to full feedback)

Contexts#

Algorithm may observe context before chosing an arm
Personalized actions

News Website : user location & demographics

Dynamic Pricing : customer’s device, location, demographics

Bayesian Priors#

Problem instance comes from known distribution called Bayesian Prior

Stochastic Bandits#

IID Rewards
\(K\) Possible Arms/ Actions
\(T\) Rounds
Maximize total Reward over T rounds
Algorithm observes reward only from selected arms Bandit Feedback

Arms : \(K\)#

Possible actions/arms
Follows IID
Chosen over T rounds

[2]:

class Arms:
    @staticmethod
    def arm_1(mu=10, sigma=5):
        return np.random.normal(mu, sigma)

    @staticmethod
    def arm_2(mu=8, sigma=3):
        return np.random.normal(mu, sigma)

    @staticmethod
    def arm_3(mu=5, sigma=25):
        return np.random.normal(mu, sigma)


arms = ["arm_1", "arm_2", "arm_3"]

[3]:

def arms_cyclic_iter(arms):
    i = 0
    while True:
        if i >= len(arms):
            i = 0
        yield arms[i]
        i += 1

def print_df(df, sleep_time):
    clear_output(wait=True)
    display(df)
    time.sleep(sleep_time)

[4]:

def random_choose_arm(arms, size=None, replace=True, p=None):
    return np.random.choice(arms, size=size, replace=replace, p=p)

Reward Distribution : \(D_a\)#

Reward for each action is IID, for each action \(a\) there is a distribution \(D_a\) called Reward Distribution
Each time the action is chosen and the reward is sampled from a dsitribution that is Initially Unknown
\(K\) arms, \(T\) rounds (Known), \(D_a\) for arm \(a\) (unknown)
at \(t \in [T]\)
- pick arm \(a_t\)
- reward \(r_t \in [0, 1]\) sampled from \(D_a, a=a_t\)
- collects reward \(r_t\) & observes nothing else
Mean reward value \(\mu \in [0, 1]\)
\(\mu(a) = \mathbb{E}[D_a]\) (Mean of reward from action = expected value for reward distribution)
Example Bernoulli Distribution
below example of reward distribution is not bound in [0, 1] but shows different arms with different rewards distribution

[5]:

fig, ax = plt.subplots(1, 1, figsize=(8, 3))

params = {
    "kde": True,
    "element": "poly",
    "ax": ax,
}
sns.histplot([Arms.arm_1() for _ in range(10000)], label="arm 1", **params)
sns.histplot([Arms.arm_2() for _ in range(10000)], label="arm 2", **params)
sns.histplot([Arms.arm_3() for _ in range(10000)], label="arm 3", **params)

fig.suptitle("Reward Distribution")
fig.legend()
fig.show()

Best Mean Reward/ Optimum Reward : \(\mu^*\)#

\begin{align} \mu(a) &= \mathbb{E}[D_a] \\ \text{Optimum Reward } \mu^* &= {\text{max}\atop{a \in \mathbb{A}}} \mu(a) \end{align}

Gap in Arm : \(\Delta(a)\)#

gap in arm a is defined as difference of best mean reward and reward of action a \begin{align} \Delta(a) &:= \mu^* - \mu(a) \end{align}

Best Arm/ Optimal Arm : \(a^*\)#

optimal arm is arm with best mean reward

\begin{align} a^* &:= \mu(a) = \mu^* \rightarrow \Delta(a) \approx 0 \end{align}

[6]:

def optimal_arm(df, best_mean_reward):
    return df[df["reward_mean"] == best_mean_reward]["arm"].iloc[0]

Regret : \(R(T)\)#

Compare cumulative reward (Mean reward at t step cumulative sum till round T) to best arm benchmark (Best Mean Reward T Rounds times), what if we play same best arm for all rounds \(T\). (Always playing optimal arm) \begin{align} \text{Regret at round T} \\ R(T) = \mu^* . T - \Sigma_{t=1}^{T} \mu(a_t) \\ \end{align}

[7]:

def calculate_regret(cumulative_reward, best_mean_reward, T):
    best_arm_benchmark = best_mean_reward * T
    regret = best_arm_benchmark - cumulative_reward
    return regret

[8]:

def get_metrics(rewardsdf: pd.DataFrame, verbose=True):
    aggdf = (
        rewardsdf.groupby(by=["arm", "decision"], as_index=False)
        .agg(
            ts_count=("ts", "count"),
            ts_min=("ts", "min"),
            reward_min=("reward", "min"),
            reward_mean=("reward", "mean"),
            reward_max=("reward", "max"),
            reward_sum=("reward", "sum"),
            reward_std=("reward", "std"),
        )
        .sort_values(by=["ts_min"])
    )

    best_mean_reward = aggdf["reward_mean"].max()
    cumulative_reward = aggdf["reward_sum"].sum()
    T = aggdf["ts_count"].sum()

    optimum_reward = best_mean_reward * T

    best_arm = optimal_arm(aggdf, best_mean_reward)

    regret = calculate_regret(
        cumulative_reward=cumulative_reward, best_mean_reward=best_mean_reward, T=T
    )
    if verbose:
        print_df(aggdf, 0.2)
        print(
            f"""
        Rounds              : {T}
        Optimum Reward      : {optimum_reward}
        Cumulative Reward   : {cumulative_reward}
        Regret              : {regret}
        """
        )
    return (best_arm, aggdf, optimum_reward, cumulative_reward, regret)

[9]:

def plot_reward(rewarddf, x="ts"):
    rewarddf["decision-arm"] = rewarddf["decision"] + "-" + rewarddf["arm"]
    fig, ax = plt.subplots(2, 1, figsize=(15, 8))

    line_plot_params = {
        "alpha": 0.6,
        "color": "grey",
        "linewidth": 0.5,
    }

    sns.lineplot(data=rewarddf, x=x, y="reward", ax=ax[0], **line_plot_params)
    sns.scatterplot(data=rewarddf, x=x, y="reward", hue="decision-arm", ax=ax[0])

    sns.lineplot(data=rewarddf, x=x, y="reward_cumsum", ax=ax[1], **line_plot_params)
    sns.scatterplot(
        data=rewarddf, x=x, y="reward_cumsum", hue="decision-arm", ax=ax[1]
    )

    fig.autofmt_xdate(rotation=45)
    fig.tight_layout()
    return fig

Model#

Code below is a rudimentary implementation of how a MAB model will collect data and fit distribution.

[10]:

class Model:
    def __init__(self, arms) -> None:
        self.ts_idx = datetime(2024, 1, 1, 0, 0, 0)
        self.timedelta = timedelta(minutes=30)
        self.seq_idx = 0
        self.decision_idx = 0
        self.rewards = pd.DataFrame(
            columns=["decision_idx", "decision", "seq", "ts", "arm", "reward"]
        )
        self.arms = arms
        self.chosen_arm = None

    def get_decision_idx(self, decision):

        if (
            len(self.rewards) == 0
            or self.rewards["decision"].iloc[len(self.rewards) - 1] != decision
        ):
            self.decision_idx += 1
        return self.decision_idx

    def get_arm(self):
        if self.chosen_arm is None:
            if len(self.rewards) == 0:
                raise ValueError("First run explore to get some data")
            self.chosen_arm = get_metrics(self.rewards, verbose=False)[0]
        return self.chosen_arm

    def update_rewards(self, reward_payload):
        self.rewards.loc[len(self.rewards)] = reward_payload
        print_df(self.rewardsdf, 0.2)

    def explore(self, enforce_arm=None):
        self.chosen_arm = None
        explore_arm = enforce_arm or random_choose_arm(self.arms)
        reward = getattr(Arms, explore_arm)()

        self.ts_idx += self.timedelta
        self.seq_idx += 1

        decision = "explore"
        decision_idx = self.get_decision_idx(decision)
        self.update_rewards(
            [decision_idx, decision, self.seq_idx, self.ts_idx, explore_arm, reward]
        )

    def exploit(self):
        chosen_arm = self.get_arm()
        reward = getattr(Arms, chosen_arm)()

        self.ts_idx += self.timedelta
        self.seq_idx += 1

        decision = "exploit"
        decision_idx = self.get_decision_idx(decision)
        self.update_rewards(
            [decision_idx, decision, self.seq_idx, self.ts_idx, chosen_arm, reward]
        )

    @property
    def rewardsdf(self):
        df = self.rewards.copy()
        df.insert(len(df.columns), column="reward_cumsum", value=df["reward"].cumsum())
        return df

Only Exploration#

Algorithm#

\(K\) arms, \(T\) rounds

In each round \(t \in T\)

Algorithm picks an arm \(a_t\) (Stochastic/ Sequential)

Reward \(r_t\) is sampled/observed from distribution \(D_a\) where \(a=a_T\)

Algorithm collects reward \(r_t\), observes nothing else(Bandit feedback)

Stochastic#

[11]:

model = Model(arms)

for _ in range(100):
    model.explore()

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_3	-11.788321	-11.788321
1	1	explore	2	2024-01-01 01:00:00	arm_2	12.088685	0.300363
2	1	explore	3	2024-01-01 01:30:00	arm_1	9.827448	10.127812
3	1	explore	4	2024-01-01 02:00:00	arm_2	10.922434	21.050246
4	1	explore	5	2024-01-01 02:30:00	arm_1	8.885288	29.935534
...	...	...	...	...	...	...	...
95	1	explore	96	2024-01-03 00:00:00	arm_3	-1.721018	854.314202
96	1	explore	97	2024-01-03 00:30:00	arm_3	6.783670	861.097873
97	1	explore	98	2024-01-03 01:00:00	arm_3	-39.009226	822.088647
98	1	explore	99	2024-01-03 01:30:00	arm_2	10.385623	832.474270
99	1	explore	100	2024-01-03 02:00:00	arm_1	14.332681	846.806951

100 rows × 7 columns

[12]:

explore_rewarddf = model.rewardsdf
get_metrics(explore_rewarddf)
plot_reward(explore_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
2	arm_3	explore	40	2024-01-01 00:30:00	-40.480190	8.119206	46.224011	324.768249	23.625556
1	arm_2	explore	37	2024-01-01 01:00:00	1.412995	8.095527	13.555356	299.534482	3.103868
0	arm_1	explore	23	2024-01-01 01:30:00	3.240814	9.674097	18.647836	222.504220	4.188599


        Rounds              : 100
        Optimum Reward      : 967.4096520112096
        Cumulative Reward   : 846.806950843292
        Regret              : 120.60270116791753

Sequential#

[13]:

arms_iterator = arms_cyclic_iter(arms)
model = Model(arms)

for _ in range(100):
    model.explore(next(arms_iterator))

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_1	14.717441	14.717441
1	1	explore	2	2024-01-01 01:00:00	arm_2	10.151536	24.868977
2	1	explore	3	2024-01-01 01:30:00	arm_3	34.513253	59.382230
3	1	explore	4	2024-01-01 02:00:00	arm_1	7.857376	67.239606
4	1	explore	5	2024-01-01 02:30:00	arm_2	9.939486	77.179092
...	...	...	...	...	...	...	...
95	1	explore	96	2024-01-03 00:00:00	arm_3	-0.086446	699.868702
96	1	explore	97	2024-01-03 00:30:00	arm_1	11.695269	711.563971
97	1	explore	98	2024-01-03 01:00:00	arm_2	3.749240	715.313211
98	1	explore	99	2024-01-03 01:30:00	arm_3	-5.496441	709.816770
99	1	explore	100	2024-01-03 02:00:00	arm_1	16.264725	726.081496

100 rows × 7 columns

[14]:

explore_rewarddf = model.rewardsdf
get_metrics(explore_rewarddf)
plot_reward(explore_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
0	arm_1	explore	34	2024-01-01 00:30:00	-0.531654	10.507842	19.903238	357.266638	4.991634
1	arm_2	explore	33	2024-01-01 01:00:00	0.727823	8.089693	14.239782	266.959881	3.288865
2	arm_3	explore	33	2024-01-01 01:30:00	-47.006049	3.086514	54.848921	101.854976	25.057924


        Rounds              : 100
        Optimum Reward      : 1050.784229925647
        Cumulative Reward   : 726.0814956762381
        Regret              : 324.7027342494089

Uniform Exploration#

Algorithm#

For \(K\) arms, \(T\) rounds

Exploration Phase : Try each arm N times

Select an arm \(\hat{a}\) with highest mean reward

Exploitation Phase : use \(\hat{a}\) for rest of the rounds

[15]:

arms_iterator = arms_cyclic_iter(arms)

model = Model(arms)

for _ in range(3):
    model.explore(next(arms_iterator))

for i in range(97):
    model.exploit()

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_1	5.879398	5.879398
1	1	explore	2	2024-01-01 01:00:00	arm_2	8.022709	13.902108
2	1	explore	3	2024-01-01 01:30:00	arm_3	4.994972	18.897080
3	2	exploit	4	2024-01-01 02:00:00	arm_2	6.281499	25.178578
4	2	exploit	5	2024-01-01 02:30:00	arm_2	8.768576	33.947154
...	...	...	...	...	...	...	...
95	2	exploit	96	2024-01-03 00:00:00	arm_2	5.323829	806.060974
96	2	exploit	97	2024-01-03 00:30:00	arm_2	10.659535	816.720509
97	2	exploit	98	2024-01-03 01:00:00	arm_2	5.718620	822.439129
98	2	exploit	99	2024-01-03 01:30:00	arm_2	6.567910	829.007039
99	2	exploit	100	2024-01-03 02:00:00	arm_2	9.989318	838.996357

100 rows × 7 columns

[16]:

exploit_rewarddf = model.rewardsdf
get_metrics(exploit_rewarddf)
plot_reward(exploit_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
0	arm_1	explore	1	2024-01-01 00:30:00	5.879398	5.879398	5.879398	5.879398	NaN
2	arm_2	explore	1	2024-01-01 01:00:00	8.022709	8.022709	8.022709	8.022709	NaN
3	arm_3	explore	1	2024-01-01 01:30:00	4.994972	4.994972	4.994972	4.994972	NaN
1	arm_2	exploit	97	2024-01-01 02:00:00	-1.311294	8.454632	16.735225	820.099277	3.371759


        Rounds              : 100
        Optimum Reward      : 845.4631722542013
        Cumulative Reward   : 838.9963569129461
        Regret              : 6.466815341255142

Selective Exploitation#

Arm 1#

[17]:

model = Model(arms)

for _ in range(3):
    model.explore(next(arms_iterator))

model.chosen_arm = "arm_1"

for i in range(97):
    model.exploit()

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_1	1.584226	1.584226
1	1	explore	2	2024-01-01 01:00:00	arm_2	4.575686	6.159912
2	1	explore	3	2024-01-01 01:30:00	arm_3	-46.110331	-39.950419
3	2	exploit	4	2024-01-01 02:00:00	arm_1	8.441025	-31.509394
4	2	exploit	5	2024-01-01 02:30:00	arm_1	16.706485	-14.802909
...	...	...	...	...	...	...	...
95	2	exploit	96	2024-01-03 00:00:00	arm_1	-0.027129	887.467501
96	2	exploit	97	2024-01-03 00:30:00	arm_1	8.294347	895.761848
97	2	exploit	98	2024-01-03 01:00:00	arm_1	13.948062	909.709910
98	2	exploit	99	2024-01-03 01:30:00	arm_1	8.986954	918.696864
99	2	exploit	100	2024-01-03 02:00:00	arm_1	12.077487	930.774351

100 rows × 7 columns

[18]:

exploit_rewarddf = model.rewardsdf
get_metrics(exploit_rewarddf)
plot_reward(exploit_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
1	arm_1	explore	1	2024-01-01 00:30:00	1.584226	1.584226	1.584226	1.584226	NaN
2	arm_2	explore	1	2024-01-01 01:00:00	4.575686	4.575686	4.575686	4.575686	NaN
3	arm_3	explore	1	2024-01-01 01:30:00	-46.110331	-46.110331	-46.110331	-46.110331	NaN
0	arm_1	exploit	97	2024-01-01 02:00:00	-0.276465	10.007472	23.195923	970.724770	5.213823


        Rounds              : 100
        Optimum Reward      : 1000.7471851949623
        Cumulative Reward   : 930.7743505778576
        Regret              : 69.9728346171047

Arm 2#

[19]:

model = Model(arms)

for _ in range(3):
    model.explore(next(arms_iterator))

model.chosen_arm = "arm_2"

for i in range(97):
    model.exploit()

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_1	14.293605	14.293605
1	1	explore	2	2024-01-01 01:00:00	arm_2	5.017779	19.311384
2	1	explore	3	2024-01-01 01:30:00	arm_3	-16.783069	2.528316
3	2	exploit	4	2024-01-01 02:00:00	arm_2	4.583187	7.111502
4	2	exploit	5	2024-01-01 02:30:00	arm_2	5.678105	12.789607
...	...	...	...	...	...	...	...
95	2	exploit	96	2024-01-03 00:00:00	arm_2	8.846101	788.602028
96	2	exploit	97	2024-01-03 00:30:00	arm_2	3.746286	792.348314
97	2	exploit	98	2024-01-03 01:00:00	arm_2	2.557700	794.906013
98	2	exploit	99	2024-01-03 01:30:00	arm_2	5.727292	800.633305
99	2	exploit	100	2024-01-03 02:00:00	arm_2	9.785695	810.419000

100 rows × 7 columns

[20]:

exploit_rewarddf = model.rewardsdf
get_metrics(exploit_rewarddf)
plot_reward(exploit_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
0	arm_1	explore	1	2024-01-01 00:30:00	14.293605	14.293605	14.293605	14.293605	NaN
2	arm_2	explore	1	2024-01-01 01:00:00	5.017779	5.017779	5.017779	5.017779	NaN
3	arm_3	explore	1	2024-01-01 01:30:00	-16.783069	-16.783069	-16.783069	-16.783069	NaN
1	arm_2	exploit	97	2024-01-01 02:00:00	0.325208	8.328770	16.413429	807.890684	3.349533


        Rounds              : 100
        Optimum Reward      : 1429.3605254607367
        Cumulative Reward   : 810.4189997093333
        Regret              : 618.9415257514033

Arm 3#

[21]:

model = Model(arms)

for _ in range(3):
    model.explore(next(arms_iterator))

model.chosen_arm = "arm_3"

for i in range(97):
    model.exploit()

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_1	9.701172	9.701172
1	1	explore	2	2024-01-01 01:00:00	arm_2	8.269244	17.970416
2	1	explore	3	2024-01-01 01:30:00	arm_3	31.149943	49.120360
3	2	exploit	4	2024-01-01 02:00:00	arm_3	-54.441744	-5.321384
4	2	exploit	5	2024-01-01 02:30:00	arm_3	-24.978085	-30.299469
...	...	...	...	...	...	...	...
95	2	exploit	96	2024-01-03 00:00:00	arm_3	11.065099	835.878021
96	2	exploit	97	2024-01-03 00:30:00	arm_3	-15.780444	820.097577
97	2	exploit	98	2024-01-03 01:00:00	arm_3	5.832560	825.930137
98	2	exploit	99	2024-01-03 01:30:00	arm_3	-19.050207	806.879931
99	2	exploit	100	2024-01-03 02:00:00	arm_3	16.246434	823.126365

100 rows × 7 columns

[22]:

exploit_rewarddf = model.rewardsdf
get_metrics(exploit_rewarddf)
plot_reward(exploit_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
0	arm_1	explore	1	2024-01-01 00:30:00	9.701172	9.701172	9.701172	9.701172	NaN
1	arm_2	explore	1	2024-01-01 01:00:00	8.269244	8.269244	8.269244	8.269244	NaN
3	arm_3	explore	1	2024-01-01 01:30:00	31.149943	31.149943	31.149943	31.149943	NaN
2	arm_3	exploit	97	2024-01-01 02:00:00	-54.441744	7.979443	57.080189	774.006005	23.401965


        Rounds              : 100
        Optimum Reward      : 3114.9943150453537
        Cumulative Reward   : 823.1263646215909
        Regret              : 2291.8679504237625

Alternative Sequentially Exploration & Exploitation#

[23]:

def explore_exploit(
    n_iterations=100,
    explore_iters=10,
    exploit_iters=10,
    arms=arms,
):

    model = Model(arms)
    run_explore = True

    explore_counter, exploit_counter = 0, 0

    for i in range(n_iterations):
        if run_explore:
            model.explore()
            explore_counter += 1
        else:
            model.exploit()
            exploit_counter += 1

        if explore_counter == explore_iters:
            run_explore = False
            explore_counter = 0

        if exploit_counter == exploit_iters:
            run_explore = True
            exploit_counter = 0

    return model

Equally Explore & Exploit#

[24]:

model = explore_exploit(
    n_iterations=100,
    explore_iters=10,
    exploit_iters=10,
    arms=["arm_1", "arm_2", "arm_3"],
)

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_3	-39.184432	-39.184432
1	1	explore	2	2024-01-01 01:00:00	arm_3	-2.250016	-41.434448
2	1	explore	3	2024-01-01 01:30:00	arm_3	8.326022	-33.108426
3	1	explore	4	2024-01-01 02:00:00	arm_3	15.703394	-17.405032
4	1	explore	5	2024-01-01 02:30:00	arm_1	10.646341	-6.758691
...	...	...	...	...	...	...	...
95	10	exploit	96	2024-01-03 00:00:00	arm_1	9.047375	786.614477
96	10	exploit	97	2024-01-03 00:30:00	arm_1	3.454034	790.068511
97	10	exploit	98	2024-01-03 01:00:00	arm_1	7.935937	798.004448
98	10	exploit	99	2024-01-03 01:30:00	arm_1	5.877367	803.881815
99	10	exploit	100	2024-01-03 02:00:00	arm_1	18.039669	821.921484

100 rows × 7 columns

[25]:

experiment_rewarddf = model.rewardsdf
get_metrics(experiment_rewarddf)
plot_reward(experiment_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
3	arm_3	explore	20	2024-01-01 00:30:00	-62.780166	2.378776	47.756459	47.575522	24.297217
1	arm_1	explore	19	2024-01-01 02:30:00	1.944325	10.068205	18.025927	191.295897	4.819601
2	arm_2	explore	11	2024-01-01 03:00:00	2.590968	7.912387	14.106835	87.036258	4.005542
0	arm_1	exploit	50	2024-01-01 05:30:00	3.263505	9.920276	21.659107	496.013807	4.229813


        Rounds              : 100
        Optimum Reward      : 1006.8205081327129
        Cumulative Reward   : 821.9214841863763
        Regret              : 184.89902394633657

Less Explore & More Exploit#

[26]:

model = explore_exploit(
    n_iterations=100,
    explore_iters=5,
    exploit_iters=10,
    arms=arms,
)

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_2	9.242226	9.242226
1	1	explore	2	2024-01-01 01:00:00	arm_3	6.834996	16.077222
2	1	explore	3	2024-01-01 01:30:00	arm_2	3.984192	20.061414
3	1	explore	4	2024-01-01 02:00:00	arm_1	7.456886	27.518300
4	1	explore	5	2024-01-01 02:30:00	arm_1	3.323603	30.841903
...	...	...	...	...	...	...	...
95	14	exploit	96	2024-01-03 00:00:00	arm_2	2.434778	681.921263
96	14	exploit	97	2024-01-03 00:30:00	arm_2	6.428046	688.349309
97	14	exploit	98	2024-01-03 01:00:00	arm_2	8.146294	696.495603
98	14	exploit	99	2024-01-03 01:30:00	arm_2	6.032236	702.527839
99	14	exploit	100	2024-01-03 02:00:00	arm_2	10.112191	712.640030

100 rows × 7 columns

[27]:

experiment_rewarddf = model.rewardsdf
get_metrics(experiment_rewarddf)
plot_reward(experiment_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
2	arm_2	explore	16	2024-01-01 00:30:00	3.321153	9.472455	12.793415	151.559277	2.765254
4	arm_3	explore	11	2024-01-01 01:00:00	-11.956004	5.966866	38.700383	65.635525	15.065037
0	arm_1	explore	8	2024-01-01 02:00:00	2.946490	8.701732	19.989582	69.613853	5.662205
3	arm_3	exploit	10	2024-01-01 03:00:00	-34.573848	-1.453645	37.609636	-14.536448	24.937441
1	arm_2	exploit	55	2024-01-01 10:30:00	1.818452	8.006688	14.404394	440.367822	3.353643


        Rounds              : 100
        Optimum Reward      : 947.2454819980799
        Cumulative Reward   : 712.6400299643165
        Regret              : 234.6054520337634

Epsilon (\(\epsilon\)) Greedy#

Algorithm#

for \(K\) arms, \(T\) rounds

for each round \(t \in [T]\)

Toss a coin and generate success probability \(\epsilon_t\)

if success then explore : choose any arm uniformly at random

if failure then exploit : choose an arm with highest average rewards so far

[28]:

def epsilon_greedy(n_iterations=100, epsilon=0.2, arms=arms):

    model = Model(arms)

    model.explore()

    for i in range(n_iterations):

        p = np.random.random()

        if p < epsilon:
            model.explore()
        else:
            model.exploit()

    return model

\(\epsilon\) = 0.1#

[29]:

model = epsilon_greedy(n_iterations=100, epsilon=0.1, arms=arms)

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_1	13.164255	13.164255
1	1	explore	2	2024-01-01 01:00:00	arm_1	7.513586	20.677841
2	2	exploit	3	2024-01-01 01:30:00	arm_1	14.617776	35.295617
3	2	exploit	4	2024-01-01 02:00:00	arm_1	5.952256	41.247873
4	2	exploit	5	2024-01-01 02:30:00	arm_1	3.362488	44.610361
...	...	...	...	...	...	...	...
96	18	exploit	97	2024-01-03 00:30:00	arm_1	13.497176	923.444317
97	18	exploit	98	2024-01-03 01:00:00	arm_1	9.582637	933.026954
98	18	exploit	99	2024-01-03 01:30:00	arm_1	1.226871	934.253825
99	18	exploit	100	2024-01-03 02:00:00	arm_1	25.016123	959.269948
100	18	exploit	101	2024-01-03 02:30:00	arm_1	9.758156	969.028104

101 rows × 7 columns

[30]:

experiment_rewarddf = model.rewardsdf
get_metrics(experiment_rewarddf)
plot_reward(experiment_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
1	arm_1	explore	5	2024-01-01 00:30:00	7.513586	11.000555	13.164255	55.002775	2.568003
0	arm_1	exploit	33	2024-01-01 01:30:00	1.226871	11.041362	25.016123	364.364938	4.833072
4	arm_3	explore	2	2024-01-01 07:30:00	-3.164411	10.910349	24.985109	21.820698	19.904717
3	arm_3	exploit	58	2024-01-01 08:00:00	-38.117554	8.676521	56.969118	503.238221	24.102827
2	arm_2	explore	3	2024-01-01 10:30:00	8.029772	8.200490	8.305829	24.601471	0.149190


        Rounds              : 101
        Optimum Reward      : 1115.1775375160755
        Cumulative Reward   : 969.0281035361436
        Regret              : 146.14943397993193

\(\epsilon\) = 0.2#

[31]:

model = epsilon_greedy(n_iterations=100, epsilon=0.2, arms=arms)

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_2	9.872633	9.872633
1	2	exploit	2	2024-01-01 01:00:00	arm_2	5.260020	15.132653
2	2	exploit	3	2024-01-01 01:30:00	arm_2	10.135395	25.268048
3	2	exploit	4	2024-01-01 02:00:00	arm_2	8.162236	33.430284
4	2	exploit	5	2024-01-01 02:30:00	arm_2	8.573806	42.004090
...	...	...	...	...	...	...	...
96	29	explore	97	2024-01-03 00:30:00	arm_1	9.712263	655.099951
97	30	exploit	98	2024-01-03 01:00:00	arm_3	-1.338992	653.760959
98	30	exploit	99	2024-01-03 01:30:00	arm_3	-4.951924	648.809035
99	30	exploit	100	2024-01-03 02:00:00	arm_3	40.488164	689.297199
100	30	exploit	101	2024-01-03 02:30:00	arm_3	33.910435	723.207635

101 rows × 7 columns

[32]:

experiment_rewarddf = model.rewardsdf
get_metrics(experiment_rewarddf)
plot_reward(experiment_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
2	arm_2	explore	10	2024-01-01 00:30:00	3.743866	8.093269	13.298554	80.932688	2.887230
1	arm_2	exploit	8	2024-01-01 01:00:00	4.105006	7.720173	12.241799	61.761386	2.673941
4	arm_3	explore	3	2024-01-01 05:00:00	12.307630	21.670430	31.019323	65.011289	9.355854
3	arm_3	exploit	77	2024-01-01 05:30:00	-44.760957	6.359283	54.824160	489.664770	26.274692
0	arm_1	explore	3	2024-01-01 06:30:00	7.971751	8.612500	9.712263	25.837501	0.956747


        Rounds              : 101
        Optimum Reward      : 2188.713387428091
        Cumulative Reward   : 723.2076348563972
        Regret              : 1465.5057525716938

\(\epsilon\) = 0.4#

[33]:

model = epsilon_greedy(n_iterations=100, epsilon=0.4, arms=arms)

	decision_idx	decision	seq	ts	arm	reward	reward_cumsum
0	1	explore	1	2024-01-01 00:30:00	arm_3	24.506117	24.506117
1	2	exploit	2	2024-01-01 01:00:00	arm_3	11.573501	36.079618
2	3	explore	3	2024-01-01 01:30:00	arm_2	7.527144	43.606761
3	4	exploit	4	2024-01-01 02:00:00	arm_3	47.611542	91.218303
4	5	explore	5	2024-01-01 02:30:00	arm_1	4.784452	96.002755
...	...	...	...	...	...	...	...
96	52	exploit	97	2024-01-03 00:30:00	arm_3	4.865530	563.582487
97	52	exploit	98	2024-01-03 01:00:00	arm_3	-20.385035	543.197452
98	53	explore	99	2024-01-03 01:30:00	arm_2	11.271790	554.469242
99	54	exploit	100	2024-01-03 02:00:00	arm_3	36.958096	591.427338
100	54	exploit	101	2024-01-03 02:30:00	arm_3	-8.002698	583.424640

101 rows × 7 columns

[34]:

experiment_rewarddf = model.rewardsdf
get_metrics(experiment_rewarddf)
plot_reward(experiment_rewarddf);

	arm	decision	ts_count	ts_min	reward_min	reward_mean	reward_max	reward_sum	reward_std
3	arm_3	explore	12	2024-01-01 00:30:00	-54.355821	18.335123	65.663022	220.021473	32.191539
2	arm_3	exploit	67	2024-01-01 01:00:00	-44.863765	2.845487	74.965554	190.647654	25.756657
1	arm_2	explore	12	2024-01-01 01:30:00	2.282932	7.773443	11.271790	93.281310	2.549957
0	arm_1	explore	10	2024-01-01 02:30:00	3.860798	7.947420	14.250761	79.474202	3.907964


        Rounds              : 101
        Optimum Reward      : 1851.8474016216946
        Cumulative Reward   : 583.4246400457904
        Regret              : 1268.4227615759041

Multi-Armed Bandit (k-Armed Bandit)

Contents

Multi-Armed Bandit (k-Armed Bandit)#

References#

Introduction#

Reward Models#

Feedbacks#

Contexts#

Bayesian Priors#

Stochastic Bandits#

Arms : \(K\)#

Reward Distribution : \(D_a\)#

Best Mean Reward/ Optimum Reward : \(\mu^*\)#

Gap in Arm : \(\Delta(a)\)#

Best Arm/ Optimal Arm : \(a^*\)#

Regret : \(R(T)\)#

Model#

Only Exploration#

Algorithm#

Stochastic#

Sequential#

Uniform Exploration#

Algorithm#

Selective Exploitation#

Arm 1#

Arm 2#

Arm 3#

Alternative Sequentially Exploration & Exploitation#

Equally Explore & Exploit#

Less Explore & More Exploit#

Epsilon (\(\epsilon\)) Greedy#

Algorithm#

\(\epsilon\) = 0.1#

\(\epsilon\) = 0.2#

\(\epsilon\) = 0.4#