Skip to main content

A/B Testing & Canary Deployment for Models

Model changes (retraining, architecture changes, prompt updates, provider model upgrades) carry risk. Controlled experimentation and canary deployment reduce blast radius and provide evidence-based release decisions.

Experimentation Framework

Every model change SHOULD be evaluated through controlled experimentation when feasible.

Define before execution:

  • Hypothesis — what improvement is expected and why
  • Primary metric — the single metric that determines success
  • Secondary metrics — safety, latency, cost, and user experience measures
  • Minimum detectable effect (MDE) — the smallest improvement worth detecting
  • Required sample size — calculated from MDE, baseline metric, and desired power
  • Experiment duration — minimum runtime to achieve required sample size

Distinguish between:

  • A/B tests — two variants (control vs. treatment)
  • Multi-arm tests — multiple variants compared simultaneously
  • Holdout tests — new model vs. no model (measures incremental value)

Traffic Splitting Standards

  • Randomization SHOULD be at the user or session level (not request level) to prevent inconsistent experience.
  • Segment isolation — experiment participants SHOULD NOT overlap across concurrent experiments on the same feature.
  • Holdout groups — maintain a persistent holdout (5-10%) not exposed to model changes for long-term impact measurement.
  • Traffic splitting SHOULD be configurable without code deployment (feature flags, configuration service).

Canary Deployment Procedures

Staged rollout percentages (adjust per risk tier):

StageTrafficMinimum Bake Time
Canary start1%24 hours (Tier 1), 72 hours (Tier 2/3)
Early rollout5%24 hours
Mid rollout25%48 hours
Late rollout50%48 hours
Full rollout100%Ongoing monitoring

Auto-halt criteria — define metric thresholds that automatically stop canary progression:

  • Error rate increase > X% above baseline
  • Latency P95 degradation > Y ms
  • Safety violation rate increase
  • User-reported quality decrease

Manual approval gates — Tier 2 and Tier 3 model changes SHOULD require human approval before progressing past 25%.

Statistical Rigor

  • Significance threshold: p < 0.05 minimum (p < 0.01 for high-risk features)
  • Statistical power: 80% minimum
  • Multiple comparison correction: apply Bonferroni or equivalent when testing multiple metrics
  • Sequential testing: when using sequential analysis for early stopping, use appropriate methods (e.g., always-valid p-values, spending functions)

Guard against:

  • Peeking — checking results before reaching required sample size
  • Underpowered experiments — running too short or with too little traffic
  • Cherry-picking metrics — declaring success based on whichever metric improved
  • Simpson's paradox — verify segment-level results are consistent with aggregate

Experiment Lifecycle

Design → Review → Execute → Analyze → Decision → Document
  • Design: Define hypothesis, metrics, sample size, duration
  • Review: Peer review of experiment design (required for Tier 2/3)
  • Execute: Deploy with traffic splitting and monitoring
  • Analyze: Statistical analysis at experiment conclusion
  • Decision: Go / No-Go / Extend based on results
  • Document: Record what was decided, why, metric results, and approver

Archive experiment results for organizational learning.

Cross-References