A/B Testing & Canary Deployment for Models

Model changes (retraining, architecture changes, prompt updates, provider model upgrades) carry risk. Controlled experimentation and canary deployment reduce blast radius and provide evidence-based release decisions.

Experimentation Framework

Every model change SHOULD be evaluated through controlled experimentation when feasible.

Define before execution:

Hypothesis — what improvement is expected and why
Primary metric — the single metric that determines success
Secondary metrics — safety, latency, cost, and user experience measures
Minimum detectable effect (MDE) — the smallest improvement worth detecting
Required sample size — calculated from MDE, baseline metric, and desired power
Experiment duration — minimum runtime to achieve required sample size

Distinguish between:

A/B tests — two variants (control vs. treatment)
Multi-arm tests — multiple variants compared simultaneously
Holdout tests — new model vs. no model (measures incremental value)

Traffic Splitting Standards

Randomization SHOULD be at the user or session level (not request level) to prevent inconsistent experience.
Segment isolation — experiment participants SHOULD NOT overlap across concurrent experiments on the same feature.
Holdout groups — maintain a persistent holdout (5-10%) not exposed to model changes for long-term impact measurement.
Traffic splitting SHOULD be configurable without code deployment (feature flags, configuration service).

Canary Deployment Procedures

Staged rollout percentages (adjust per risk tier):

Stage	Traffic	Minimum Bake Time
Canary start	1%	24 hours (Tier 1), 72 hours (Tier 2/3)
Early rollout	5%	24 hours
Mid rollout	25%	48 hours
Late rollout	50%	48 hours
Full rollout	100%	Ongoing monitoring

Auto-halt criteria — define metric thresholds that automatically stop canary progression:

Error rate increase > X% above baseline
Latency P95 degradation > Y ms
Safety violation rate increase
User-reported quality decrease

Manual approval gates — Tier 2 and Tier 3 model changes SHOULD require human approval before progressing past 25%.

Statistical Rigor

Significance threshold: p < 0.05 minimum (p < 0.01 for high-risk features)
Statistical power: 80% minimum
Multiple comparison correction: apply Bonferroni or equivalent when testing multiple metrics
Sequential testing: when using sequential analysis for early stopping, use appropriate methods (e.g., always-valid p-values, spending functions)

Guard against:

Peeking — checking results before reaching required sample size
Underpowered experiments — running too short or with too little traffic
Cherry-picking metrics — declaring success based on whichever metric improved
Simpson's paradox — verify segment-level results are consistent with aggregate

Experiment Lifecycle

Design → Review → Execute → Analyze → Decision → Document

Design: Define hypothesis, metrics, sample size, duration
Review: Peer review of experiment design (required for Tier 2/3)
Execute: Deploy with traffic splitting and monitoring
Analyze: Statistical analysis at experiment conclusion
Decision: Go / No-Go / Extend based on results
Document: Record what was decided, why, metric results, and approver

Archive experiment results for organizational learning.

Experimentation Framework​

Traffic Splitting Standards​

Canary Deployment Procedures​

Statistical Rigor​

Experiment Lifecycle​

Cross-References​

Experimentation Framework

Traffic Splitting Standards

Canary Deployment Procedures

Statistical Rigor

Experiment Lifecycle

Cross-References