# Final Assessment: Reinforcement Learning

## Assignment

Design an offline policy improvement plan for a logged decision system.

Your plan must define the state, action set, reward, guardrails, logged data requirements, off-policy evaluation approach, support checks, and deployment plan.

## Required Artifact

Submit an offline policy improvement plan.

Minimum sections:

- decision system
- state representation
- action set
- reward
- guardrails
- logged data requirements
- behavior policy and support checks
- off-policy evaluation method
- conservative policy constraints
- staged rollout plan
- strongest critique

## Rubric

| Criterion | Strong | Needs revision |
|---|---|---|
| Decision framing | Defines the operational decision and cadence | Describes an algorithm without a decision |
| State/action design | Uses only information available at decision time | Leaks future information into state |
| Reward | Connects reward to a real outcome and names delayed effects | Optimizes a proxy without caveats |
| Guardrails | Protects user, business, system, and fairness risks | Optimizes one metric only |
| Logs | Requires action set, chosen action, propensities, outcomes, and constraints | Assumes logs are sufficient by default |
| OPE | Matches method to available propensities and support | Trusts simulated reward alone |
| Deployment | Uses shadow mode, human review, canary, randomized test, and ramp criteria | Jumps from offline estimate to full launch |
| Critique | Identifies reward hacking, support gaps, or historical bias | Treats policy learning as purely technical |

## Pass Criteria

Pass if the plan could be reviewed by an ML, product, and operations team before deploying a learned policy.

## Submission Checklist

- [ ] State uses decision-time variables only.
- [ ] Action set includes fallback and human-review cases.
- [ ] Reward and guardrails are separate.
- [ ] Behavior policy probabilities are addressed.
- [ ] Support gaps are handled conservatively.
- [ ] Offline evaluation is not treated as final proof.
- [ ] Rollout has stop criteria.

## Certificate Language

Completed Reinforcement Learning by producing an offline policy improvement plan with support checks, off-policy evaluation, guardrails, and staged deployment criteria.