The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
In order to use cram_bandit()
, users must supply a
matrix of action selection probabilities πt(Xj,Aj) for each combination of
policy update t and context j in the historical dataset.
While some environments log these probabilities directly, many
contextual bandit libraries (such as contextual
)
only store policy parameters (e.g., regression
coefficients) without explicit probability tracking.
This article explains how Cram Bandit Helpers reconstruct πt(Xj,Aj) from these parameters for common policies:
Policy Type | Class Name |
---|---|
Epsilon-Greedy | BatchContextualEpsilonGreedyPolicy |
LinUCB Disjoint with ε-greedy exploration | BatchLinUCBDisjointPolicyEpsilon |
Thompson Sampling | BatchContextualLinTSPolicy |
Both theoretical formulas and practical code snippets are provided.
When using linear bandit algorithms like Epsilon-Greedy, LinUCB, or Thompson Sampling, each arm k maintains summary statistics (parameters) to estimate the expected reward:
Ak is the Gram
matrix:
Ak=XTkXk where Xk is the matrix
of feature vectors (contexts) for all rounds where arm k was selected.
➔ Interpretation: Ak captures the amount of information
(and correlation structure) about the features for arm k. It plays the role of a “feature
covariance matrix.”
bk is the response
vector:
bk=XTkyk where yk are the
observed rewards for arm k.
➔ Interpretation: bk captures the relationship between the
observed rewards and the contexts for arm k.
These sufficient statistics allow the policy to compute the Least Squares estimate for the reward model:
θk=A−1kbk
where:
Thus:
The policy selects an action based on the θk of each arm k and then observe the reward associated with this choice, which is used to update the parameters Ak and bk of the policy.
In Epsilon-Greedy, with exploration rate ε, the probability of selecting one of the best arms is:
P(At|Xt)=(1−ε)×1#best arms+ε×1K
While the probability of selecting an arm that is not among the best arms is:
P(At|Xt)=ε×1K
where:
We define the least squares estimate as:
θk=A−1kbk(Least Squares estimate)
where:
Best arms are identified via the estimated expected reward:
Expected reward=XTtθk
LinUCB selects arms based on Upper Confidence Bounds (UCBs):
UCBk(Xt)=μk(Xt)+ασk(Xt)
where:
The action probabilities follow the same structure as Epsilon-Greedy but with UCB scores instead of plain expected rewards i.e. the probability to select one of the best arms is:
P(At|Xt)=(1−ε)×1#best arms+ε×1K
While the probability to select an arm that is not among the best arms is:
P(At|Xt)=ε×1K
where “best arms” are those with highest UCB scores.
In Thompson Sampling, actions are sampled according to posterior draws and the action associated with the maximum value is chosen. The probability that the arm At is optimal is:
P(At|Xt)=P(θTAtXt>θTkXt∀k≠At)
where θk∼N(A−1kbk,σ2A−1k).
This requires computing a multivariate probability, which we approximate via adaptive numerical integration.
When using your bandit policy in practice:
pi
, arm
, and reward
into
cram_bandit()
for evaluation of your policy.cram_bandit_sim()
The following only concerns the simulation framework we implemented for benchmarking purposes.
Once the policies are reconstructed, we compute their true expected value — referred to as the estimand — by applying the learned policy to independent contexts and evaluating it against the known reward function used in the simulation.
This is done via:
Accurately computing the estimand is critical for properly assessing the bias and confidence interval coverage of the Cram estimate in our simulations.
contextual
package: original frameworkcram_bandit()
: Cram evaluation for contextual
banditscram_bandit_sim()
: Full simulation engine with
automatic pi estimationThese helper functions were designed to faithfully reconstruct action
probabilities for the policies implemented in contextual
,
while enabling reproducible Cram-based evaluation.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.