Zero-Downtime Deployment Guide

Achieving zero downtime during deployments requires coordinating database migrations, connection management, health checks, and application startup. This guide covers each component with implementation details.

Database Migration Patterns

Database schema changes are the primary source of downtime during deployments. The following patterns allow schema evolution without breaking running application instances.

Expand-Contract (Parallel Change)

The safest pattern for most schema changes. Works in three phases:

  1. Expand: Add new column/table. Both old and new code work. Deploy migration independently of application code.
  2. Migrate: Backfill data from old structure to new. Deploy application code that writes to both old and new.
  3. Contract: Remove old column/table after all application instances use the new structure.

Shadow Columns

Add a new column alongside the existing one. Write to both columns simultaneously during the transition period. Read from the new column in new code, old column in old code. Remove the old column after full rollout. Useful for column renames or type changes.

Avoid These During Zero-Downtime Deployments

Renaming columns (breaks old code), dropping columns (breaks old code), adding NOT NULL columns without defaults (locks table on large datasets in some databases), changing column types in place (may require table lock).

Connection Draining

When removing an instance from the load balancer during a rolling update, in-flight connections must complete before the instance is terminated.

ComponentConfigurationRecommended Value
AWS ALBDeregistration delay (target group attribute)30-300 seconds (default 300, lower for fast APIs)
KubernetesterminationGracePeriodSeconds + preStop hook30-60 seconds grace period, preStop sleep 5-10s for endpoint propagation
Nginxupstream health checks + graceful reloadworker_shutdown_timeout 30s
Envoy / IstiodrainDuration in Sidecar resource25-45 seconds (must be less than terminationGracePeriodSeconds)

Health Check Endpoints

Design health check endpoints to give accurate signals about application readiness and liveness.

EndpointChecksUsed ByFailure Means
/healthzProcess is alive, basic memory checkLiveness probePod is restarted
/readyDatabase connection pool, cache connectivity, downstream service reachableReadiness probe, load balancerTraffic removed from pod until it recovers
/startupApplication initialization complete (schema loaded, caches warmed)Startup probeLiveness and readiness probes are delayed until startup succeeds

Readiness vs Liveness Probes

A common mistake is making liveness and readiness probes check the same thing. They serve different purposes and should have different implementations.

AspectReadiness ProbeLiveness Probe
Question answeredCan this pod handle requests right now?Is this pod stuck and needs a restart?
Check dependencies?Yes -- database, cache, downstream servicesNo -- only check if the process is responsive
Failure actionRemove from Service endpoints (stop traffic)Kill and restart the pod
Danger of bad configToo strict: healthy pods removed unnecessarilyToo strict: restart loops kill the entire deployment

Zero-Downtime Deployment Checklist