Zero-Downtime Deployment Guide

Achieving zero downtime during deployments requires coordinating database migrations, connection management, health checks, and application startup. This guide covers each component with implementation details.

Database Migration Patterns

Database schema changes are the primary source of downtime during deployments. The following patterns allow schema evolution without breaking running application instances.

Expand-Contract (Parallel Change)

The safest pattern for most schema changes. Works in three phases:

Expand: Add new column/table. Both old and new code work. Deploy migration independently of application code.
Migrate: Backfill data from old structure to new. Deploy application code that writes to both old and new.
Contract: Remove old column/table after all application instances use the new structure.

Shadow Columns

Add a new column alongside the existing one. Write to both columns simultaneously during the transition period. Read from the new column in new code, old column in old code. Remove the old column after full rollout. Useful for column renames or type changes.

Avoid These During Zero-Downtime Deployments

Renaming columns (breaks old code), dropping columns (breaks old code), adding NOT NULL columns without defaults (locks table on large datasets in some databases), changing column types in place (may require table lock).

Connection Draining

When removing an instance from the load balancer during a rolling update, in-flight connections must complete before the instance is terminated.

Component	Configuration	Recommended Value
AWS ALB	Deregistration delay (target group attribute)	30-300 seconds (default 300, lower for fast APIs)
Kubernetes	terminationGracePeriodSeconds + preStop hook	30-60 seconds grace period, preStop sleep 5-10s for endpoint propagation
Nginx	upstream health checks + graceful reload	worker_shutdown_timeout 30s
Envoy / Istio	drainDuration in Sidecar resource	25-45 seconds (must be less than terminationGracePeriodSeconds)

Health Check Endpoints

Design health check endpoints to give accurate signals about application readiness and liveness.

Endpoint	Checks	Used By	Failure Means
/healthz	Process is alive, basic memory check	Liveness probe	Pod is restarted
/ready	Database connection pool, cache connectivity, downstream service reachable	Readiness probe, load balancer	Traffic removed from pod until it recovers
/startup	Application initialization complete (schema loaded, caches warmed)	Startup probe	Liveness and readiness probes are delayed until startup succeeds

Readiness vs Liveness Probes

A common mistake is making liveness and readiness probes check the same thing. They serve different purposes and should have different implementations.

Aspect	Readiness Probe	Liveness Probe
Question answered	Can this pod handle requests right now?	Is this pod stuck and needs a restart?
Check dependencies?	Yes -- database, cache, downstream services	No -- only check if the process is responsive
Failure action	Remove from Service endpoints (stop traffic)	Kill and restart the pod
Danger of bad config	Too strict: healthy pods removed unnecessarily	Too strict: restart loops kill the entire deployment

Zero-Downtime Deployment Checklist

Database migrations are backward-compatible (expand-contract pattern)Application handles SIGTERM gracefully (drains in-flight requests)Readiness probe returns 503 during shutdown (stops new traffic)preStop hook includes sleep for endpoint propagation delay (5-10 seconds)terminationGracePeriodSeconds is greater than expected drain timeLoad balancer deregistration delay matches application drain timeoutHealth check endpoints are implemented (/healthz, /ready, /startup)Liveness probe does NOT check external dependenciesPodDisruptionBudget is configured to maintain minimum availabilityRolling update maxSurge and maxUnavailable are tuned for capacitySmoke tests run against new instances before receiving trafficMonitoring alerts are configured for error rate and latency spikesRollback procedure is documented and tested