Bandit Environments¶
Bandit
¶

class
numpy_ml.bandits.bandits.
Bandit
(rewards, reward_probs, context=None)[source]¶ 

oracle_payoff
(context=None)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – The current context matrix for each of the bandit arms, if applicable. Default is None.Returns: optimal_rwd (float) – The expected reward under an optimal policy.

MultinomialBandit
¶

class
numpy_ml.bandits.
MultinomialBandit
(payoffs, payoff_probs)[source]¶ Bases:
numpy_ml.bandits.bandits.Bandit
A multiarmed bandit where each arm is associated with a different multinomial payoff distribution.
Parameters:  payoffs (ragged list of length K) – The payoff values for each of the n bandits.
payoffs[k][i]
holds the i th payoff value for arm k.  payoff_probs (ragged list of length K) – A list of the probabilities associated with each of the payoff
values in
payoffs
.payoff_probs[k][i]
holds the probability of payoff index i for arm k.

oracle_payoff
(context=None)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – Unused. Default is None.Returns:  optimal_rwd (float) – The expected reward under an optimal policy.
 optimal_arm (float) – The arm ID with the largest expected reward.
 payoffs (ragged list of length K) – The payoff values for each of the n bandits.
BernoulliBandit
¶

class
numpy_ml.bandits.
BernoulliBandit
(payoff_probs)[source]¶ Bases:
numpy_ml.bandits.bandits.Bandit
A multiarmed bandit where each arm is associated with an independent Bernoulli payoff distribution.
Parameters: payoff_probs (list of length K) – A list of the payoff probability for each arm. payoff_probs[k]
holds the probability of payoff for arm k.
oracle_payoff
(context=None)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – Unused. Default is None.Returns:  optimal_rwd (float) – The expected reward under an optimal policy.
 optimal_arm (float) – The arm ID with the largest expected reward.

GaussianBandit
¶

class
numpy_ml.bandits.
GaussianBandit
(payoff_dists, payoff_probs)[source]¶ Bases:
numpy_ml.bandits.bandits.Bandit
A multiarmed bandit that is similar to
BernoulliBandit
, but instead of each arm having a fixed payout of 1, the payoff values are sampled from independent Gaussian RVs.Parameters:  payoff_dists (list of 2tuples of length K) – The parameters the distributions over payoff values for each of the
n arms. Specifically,
payoffs[k]
is a tuple of (mean, variance) for the Gaussian distribution over payoffs associated with arm k.  payoff_probs (list of length n) – A list of the probabilities associated with each of the payoff
values in
payoffs
.payoff_probs[k]
holds the probability of payoff for arm k.

oracle_payoff
(context=None)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – Unused. Default is None.Returns:  optimal_rwd (float) – The expected reward under an optimal policy.
 optimal_arm (float) – The arm ID with the largest expected reward.
 payoff_dists (list of 2tuples of length K) – The parameters the distributions over payoff values for each of the
n arms. Specifically,
ShortestPathBandit
¶

class
numpy_ml.bandits.
ShortestPathBandit
(G, start_vertex, end_vertex)[source]¶ Bases:
numpy_ml.bandits.bandits.Bandit
A weighted graph shortest path problem formulated as a multiarmed bandit.
Notes
Each arm corresponds to a valid path through the graph from start to end vertex. The agent’s goal is to find the path that minimizes the expected sum of the weights on the edges it traverses.
Parameters: 
oracle_payoff
(context=None)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – Unused. Default is None.Returns:  optimal_rwd (float) – The expected reward under an optimal policy.
 optimal_arm (float) – The arm ID with the largest expected reward.

ContextualBernoulliBandit
¶

class
numpy_ml.bandits.
ContextualBernoulliBandit
(context_probs)[source]¶ Bases:
numpy_ml.bandits.bandits.Bandit
A contextual version of
BernoulliBandit
where each binary context feature is associated with an independent Bernoulli payoff distribution.Parameters: context_probs ( ndarray
of shape (D, K)) – A matrix of the payoff probabilities associated with each of the D context features, for each of the K arms. Index (i, j) contains the probability of payoff for arm j under context i.
get_context
()[source]¶ Sample a random onehot context vector. This vector will be the same for all arms.
Returns: context ( ndarray
of shape (D, K)) – A random Ddimensional onehot context vector repeated for each of the K bandit arms.

oracle_payoff
(context)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – The current context matrix for each of the bandit arms.Returns:  optimal_rwd (float) – The expected reward under an optimal policy.
 optimal_arm (float) – The arm ID with the largest expected reward.

ContextualLinearBandit
¶

class
numpy_ml.bandits.
ContextualLinearBandit
(K, D, payoff_variance=1)[source]¶ Bases:
numpy_ml.bandits.bandits.Bandit
A contextual linear multiarmed bandit.
Notes
In a contextual linear bandit the expected payoff of an arm \(a \in \mathcal{A}\) at time t is a linear combination of its context vector \(\mathbf{x}_{t,a}\) with a coefficient vector \(\theta_a\):
\[\mathbb{E}[r_{t, a} \mid \mathbf{x}_{t, a}] = \mathbf{x}_{t,a}^\top \theta_a\]In this implementation, the arm coefficient vectors \(\theta\) are initialized independently from a uniform distribution on the interval [1, 1], and the specific reward at timestep t is normally distributed:
\[r_{t, a} \mid \mathbf{x}_{t, a} \sim \mathcal{N}(\mathbf{x}_{t,a}^\top \theta_a, \sigma_a^2)\]Parameters: 
get_context
()[source]¶ Sample the context vectors for each arm from a multivariate standard normal distribution.
Returns: context ( ndarray
of shape (D, K)) – A Ddimensional context vector sampled from a standard normal distribution for each of the K bandit arms.

oracle_payoff
(context)[source]¶ Return the expected reward for an optimal agent.
Parameters: context ( ndarray
of shape (D, K) or None) – The current context matrix for each of the bandit arms, if applicable. Default is None.Returns:  optimal_rwd (float) – The expected reward under an optimal policy.
 optimal_arm (float) – The arm ID with the largest expected reward.
