# Workshop Guide: Reinforcement Learning

## Audience

ML engineers, data scientists, product teams, and operators designing adaptive policies or logged-decision systems.

## 60-Minute Agenda

1. 0-10 min: Pick a repeated decision system.
2. 10-20 min: Define state, action set, and reward.
3. 20-35 min: Identify logs and support requirements.
4. 35-50 min: Design guardrails and fallback behavior.
5. 50-60 min: Share rollout plan and biggest safety risk.

## 90-Minute Agenda

1. 0-10 min: Review the worked offline policy plan.
2. 10-25 min: Teams frame decision, state, actions, and reward.
3. 25-40 min: Teams audit logged data and behavior-policy probabilities.
4. 40-55 min: Choose off-policy evaluation approach and support checks.
5. 55-75 min: Design guardrails, fallback policy, and staged rollout.
6. 75-90 min: Group critique: where could the optimizer exploit artifacts?

## Team Exercise

Each team produces an offline policy improvement plan with:

- decision system
- state representation
- action set
- reward
- guardrails
- logged data requirements
- OPE method
- support checks
- deployment plan

## Discussion Prompts

- Which actions are unsupported in the logs?
- What would reward hacking look like?
- What state variables leak future information?
- Which decisions require human review?

## Facilitator Notes

Keep teams from jumping straight to algorithms. The core skill is policy safety: logging, support, evaluation, guardrails, and staged deployment.

Common failure modes:

- optimizing one reward without guardrails
- ignoring behavior-policy probabilities
- trusting unsupported extrapolation
- deploying from offline estimates alone

## Review Standard

Use `final-assessment.md` as the rubric. A strong plan could be reviewed by ML, product, and operations before deploying a learned policy.