Rollback Strategies
Every deployment needs a rollback plan. The right rollback strategy depends on your deployment method, database state, and the severity of the issue. This guide covers the major rollback approaches and when to use each.
Immediate Rollback (Kubernetes)
Kubernetes retains previous ReplicaSets (controlled by revisionHistoryLimit). Rolling back is a single command that switches the Deployment to use the previous ReplicaSet's pod template.
How it works
The rollback command updates the Deployment's pod template spec to match a previous revision. Kubernetes then performs a rolling update from the current (bad) version to the previous (known-good) version using the same maxSurge/maxUnavailable settings.
Rollback time
Depends on replica count and rolling update settings. For a 10-replica deployment with default settings, expect 2-5 minutes for full rollback. If speed is critical, temporarily set maxUnavailable to a higher value.
Database Rollback Considerations
Application rollbacks are straightforward. Database rollbacks are not. Data written by the new version may not be compatible with the old version's schema expectations.
| Scenario | Rollback Approach | Risk |
|---|---|---|
| Additive-only migration (new column added) | Roll back application only. Leave new column in place. Old code ignores it. | Low |
| Column removed or renamed | Cannot roll back application without restoring the column. Requires a reverse migration. | High |
| Data format changed (e.g., JSON structure) | Old code may fail to parse new data. Need data transformation or dual-format support. | High |
| New data written by new version | Old code must handle unexpected data gracefully (unknown enum values, extra fields, new relationships). | Medium |
Key principle: Always design database migrations to be backward-compatible. If the old version of your application cannot function with the new schema, you do not have a safe rollback path.
Feature Flag Rollback
Feature flags provide the fastest rollback mechanism. No deployment, no pod replacement, no load balancer changes -- just toggle a flag.
- Speed: Propagation time depends on SDK implementation. SSE/WebSocket: 1-5 seconds. Polling: up to the poll interval (typically 10-60 seconds).
- Scope: Can disable a specific feature without affecting the rest of the release. More surgical than a full application rollback.
- Limitation: Only works if the problematic code is behind a flag. Infrastructure issues, dependency failures, and performance regressions in unflagged code cannot be rolled back this way.
Blue-Green Instant Switch
In a blue-green deployment, the previous environment remains running during the stabilization window. Rollback is a load balancer switch back to the previous environment.
| Mechanism | Rollback Time | Consideration |
|---|---|---|
| ALB target group swap | <5 seconds | Previous environment must still be running and healthy |
| DNS weighted routing | 60-300 seconds (TTL dependent) | Clients may cache DNS; set low TTL before deploy |
| K8s Service selector | <2 seconds | Both Deployments must exist with different labels |
Immutable Infrastructure Rollback
In an immutable infrastructure model, servers are never modified after creation. Rollback means deploying the previous machine image or container image.
- Container images: Every build produces a tagged, immutable image. Rollback means updating the Deployment to reference the previous image tag. Image must still exist in the registry.
- AMIs / VM images: Launch new instances from the previous AMI and terminate current instances. Slower than container rollback (minutes vs seconds) but equally reliable.
- Image retention policy: Keep at least the 5 most recent images to allow rollback to any recent version. Configure lifecycle policies in ECR, GCR, or ACR accordingly.
Rollback Decision Matrix
Use this matrix to choose the right rollback approach based on the situation.
| Situation | Recommended Rollback | Expected Recovery Time |
|---|---|---|
| Single feature broken, feature is flagged | Toggle feature flag off | 1-60 seconds |
| Blue-green deployment, within stabilization window | Switch load balancer back | 2-10 seconds |
| Rolling update, no DB migration | Kubernetes rollback to previous revision | 2-5 minutes |
| Rolling update with additive DB migration | Roll back application only, leave DB as-is | 2-5 minutes |
| Destructive DB migration applied | Run reverse migration, then roll back application | 10-60 minutes (depends on data volume) |
| Infrastructure-level failure (bad AMI, misconfigured infra) | Deploy previous infrastructure version via IaC | 5-30 minutes |