Zero-Downtime Deployment Guide
Achieving zero downtime during deployments requires coordinating database migrations, connection management, health checks, and application startup. This guide covers each component with implementation details.
Database Migration Patterns
Database schema changes are the primary source of downtime during deployments. The following patterns allow schema evolution without breaking running application instances.
Expand-Contract (Parallel Change)
The safest pattern for most schema changes. Works in three phases:
- Expand: Add new column/table. Both old and new code work. Deploy migration independently of application code.
- Migrate: Backfill data from old structure to new. Deploy application code that writes to both old and new.
- Contract: Remove old column/table after all application instances use the new structure.
Shadow Columns
Add a new column alongside the existing one. Write to both columns simultaneously during the transition period. Read from the new column in new code, old column in old code. Remove the old column after full rollout. Useful for column renames or type changes.
Avoid These During Zero-Downtime Deployments
Renaming columns (breaks old code), dropping columns (breaks old code), adding NOT NULL columns without defaults (locks table on large datasets in some databases), changing column types in place (may require table lock).
Connection Draining
When removing an instance from the load balancer during a rolling update, in-flight connections must complete before the instance is terminated.
| Component | Configuration | Recommended Value |
|---|---|---|
| AWS ALB | Deregistration delay (target group attribute) | 30-300 seconds (default 300, lower for fast APIs) |
| Kubernetes | terminationGracePeriodSeconds + preStop hook | 30-60 seconds grace period, preStop sleep 5-10s for endpoint propagation |
| Nginx | upstream health checks + graceful reload | worker_shutdown_timeout 30s |
| Envoy / Istio | drainDuration in Sidecar resource | 25-45 seconds (must be less than terminationGracePeriodSeconds) |
Health Check Endpoints
Design health check endpoints to give accurate signals about application readiness and liveness.
| Endpoint | Checks | Used By | Failure Means |
|---|---|---|---|
| /healthz | Process is alive, basic memory check | Liveness probe | Pod is restarted |
| /ready | Database connection pool, cache connectivity, downstream service reachable | Readiness probe, load balancer | Traffic removed from pod until it recovers |
| /startup | Application initialization complete (schema loaded, caches warmed) | Startup probe | Liveness and readiness probes are delayed until startup succeeds |
Readiness vs Liveness Probes
A common mistake is making liveness and readiness probes check the same thing. They serve different purposes and should have different implementations.
| Aspect | Readiness Probe | Liveness Probe |
|---|---|---|
| Question answered | Can this pod handle requests right now? | Is this pod stuck and needs a restart? |
| Check dependencies? | Yes -- database, cache, downstream services | No -- only check if the process is responsive |
| Failure action | Remove from Service endpoints (stop traffic) | Kill and restart the pod |
| Danger of bad config | Too strict: healthy pods removed unnecessarily | Too strict: restart loops kill the entire deployment |