Canary Releases
Route a small percentage of production traffic to the new version. Monitor key metrics. Gradually increase traffic if metrics are healthy. Automatically roll back if thresholds are breached.
Concept
A canary release exposes the new version to a small subset of users before rolling it out to the entire fleet. The name comes from the "canary in a coal mine" -- if the canary (small traffic slice) shows problems, the full rollout is halted before it affects all users.
Unlike blue-green (which is all-or-nothing), canary deployments are progressive. Traffic is shifted incrementally -- typically starting at 1-5% and increasing through defined steps until reaching 100%. At each step, automated analysis determines whether to proceed, pause, or roll back.
Traffic Splitting Percentages
A typical progressive delivery schedule. Each step has a soak time during which metrics are evaluated.
| Step | Canary Traffic | Soak Time | Purpose |
|---|---|---|---|
| 1 | 1% | 5 minutes | Smoke test -- catch startup crashes and critical failures |
| 2 | 5% | 10 minutes | Validate error rate and latency under light load |
| 3 | 25% | 15 minutes | Stress test with meaningful traffic volume |
| 4 | 50% | 30 minutes | Full comparison between old and new versions |
| 5 | 100% | -- | Full rollout complete, old version scaled down |
Metric Monitoring
The following metrics should be compared between canary and baseline at each step. Use a statistical comparison (not just raw values) to account for variance.
| Metric | Threshold | Action on Breach |
|---|---|---|
| Error rate (5xx) | >1% increase over baseline | Automatic rollback |
| Latency p99 | >20% increase over baseline | Automatic rollback |
| Latency p50 | >10% increase over baseline | Pause and alert |
| CPU / Memory | >30% increase over baseline | Pause and alert |
| Business metrics (conversion, throughput) | >5% decrease from baseline | Pause and alert |
Automated Rollback Triggers
Automated canary analysis compares canary metrics against the baseline population. When a threshold is breached, the system automatically rolls back without human intervention.
- Immediate rollback: 5xx error rate exceeds absolute threshold (e.g., >5%) at any step, or any health check fails consistently.
- Statistical rollback: Canary metrics are statistically worse than baseline with high confidence (p-value < 0.05) after the soak period.
- Timeout rollback: If the canary analysis does not reach a "pass" verdict within the maximum allowed time, roll back as a safety measure.
Kubernetes Canary: Istio and Argo Rollouts
Istio VirtualService
Istio provides fine-grained traffic splitting via VirtualService weight configuration. You define two subsets (stable and canary) in a DestinationRule, then set traffic weights in the VirtualService. Adjusting the weights shifts traffic between versions.
Argo Rollouts
Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that natively supports canary steps. Each step specifies a weight and pause duration. Argo integrates with Prometheus, Datadog, and other metric providers for automated analysis.
Progressive Delivery Timeline
End-to-end timeline for a typical canary release, from deployment to full rollout.
| Time | Action | Canary % | Status |
|---|---|---|---|
| T+0 | Deploy canary pod(s) | 1% | Analyzing |
| T+5m | Pass step 1, increase traffic | 5% | Analyzing |
| T+15m | Pass step 2, increase traffic | 25% | Analyzing |
| T+30m | Pass step 3, increase traffic | 50% | Analyzing |
| T+60m | Pass step 4, full rollout | 100% | Complete |
| T+65m | Scale down old ReplicaSet | -- | Finalized |