A/B Testing & Canary Deployment for Models
Model changes (retraining, architecture changes, prompt updates, provider model upgrades) carry risk. Controlled experimentation and canary deployment reduce blast radius and provide evidence-based release decisions.
Experimentation Framework
Every model change SHOULD be evaluated through controlled experimentation when feasible.
Define before execution:
- Hypothesis — what improvement is expected and why
- Primary metric — the single metric that determines success
- Secondary metrics — safety, latency, cost, and user experience measures
- Minimum detectable effect (MDE) — the smallest improvement worth detecting
- Required sample size — calculated from MDE, baseline metric, and desired power
- Experiment duration — minimum runtime to achieve required sample size
Distinguish between:
- A/B tests — two variants (control vs. treatment)
- Multi-arm tests — multiple variants compared simultaneously
- Holdout tests — new model vs. no model (measures incremental value)
Traffic Splitting Standards
- Randomization SHOULD be at the user or session level (not request level) to prevent inconsistent experience.
- Segment isolation — experiment participants SHOULD NOT overlap across concurrent experiments on the same feature.
- Holdout groups — maintain a persistent holdout (5-10%) not exposed to model changes for long-term impact measurement.
- Traffic splitting SHOULD be configurable without code deployment (feature flags, configuration service).
Canary Deployment Procedures
Staged rollout percentages (adjust per risk tier):
| Stage | Traffic | Minimum Bake Time |
|---|---|---|
| Canary start | 1% | 24 hours (Tier 1), 72 hours (Tier 2/3) |
| Early rollout | 5% | 24 hours |
| Mid rollout | 25% | 48 hours |
| Late rollout | 50% | 48 hours |
| Full rollout | 100% | Ongoing monitoring |
Auto-halt criteria — define metric thresholds that automatically stop canary progression:
- Error rate increase > X% above baseline
- Latency P95 degradation > Y ms
- Safety violation rate increase
- User-reported quality decrease
Manual approval gates — Tier 2 and Tier 3 model changes SHOULD require human approval before progressing past 25%.
Statistical Rigor
- Significance threshold: p < 0.05 minimum (p < 0.01 for high-risk features)
- Statistical power: 80% minimum
- Multiple comparison correction: apply Bonferroni or equivalent when testing multiple metrics
- Sequential testing: when using sequential analysis for early stopping, use appropriate methods (e.g., always-valid p-values, spending functions)
Guard against:
- Peeking — checking results before reaching required sample size
- Underpowered experiments — running too short or with too little traffic
- Cherry-picking metrics — declaring success based on whichever metric improved
- Simpson's paradox — verify segment-level results are consistent with aggregate
Experiment Lifecycle
Design → Review → Execute → Analyze → Decision → Document
- Design: Define hypothesis, metrics, sample size, duration
- Review: Peer review of experiment design (required for Tier 2/3)
- Execute: Deploy with traffic splitting and monitoring
- Analyze: Statistical analysis at experiment conclusion
- Decision: Go / No-Go / Extend based on results
- Document: Record what was decided, why, metric results, and approver
Archive experiment results for organizational learning.