# Answer Key: Logged Support Triage Policy

Use this as a calibration guide, not a single correct answer.

## Core Readout

The logs can support limited policy learning for state-action pairs that the behavior policy tried often enough. They should not justify unconstrained automation in high-urgency or high-risk tickets unless support is strong and escalation guardrails are explicit.

## What A Strong Answer Should Say

- Summarize the behavior policy before evaluating any candidate policy.
- Identify weakly supported state-action pairs using action counts and low behavior probabilities.
- Treat resolved-in-24-hours as an incomplete reward if escalation, CSAT, fairness, or customer tier risk can worsen.
- Recommend constraints: human fallback for high-urgency tickets, no learned routing where behavior probabilities are too low, and staged rollout with monitoring.
- Name guardrails: escalation rate, CSAT, time to resolution, customer-tier disparities, and incident severity.

## Common Mistakes

- Choosing the action with the highest observed reward without checking support.
- Ignoring behavior-policy probabilities.
- Optimizing resolution speed while increasing escalations.
- Deploying learned recommendations for high-risk states before offline evaluation is credible.

## Instructor Notes

Ask: "Where would you force the old policy to remain in control?" Strong answers usually protect high-urgency, enterprise, data-loss, outage, and low-support cells.
