Zero Downtime Deployment — Rolling, Blue-Green, and Canary
Software deployment is unavoidable, but stopping the service every time you deploy is unacceptable in modern business. Zero Downtime Deployment is the technique of deploying new versions without impacting users. This chapter covers the three core strategies — Rolling Update, Blue-Green, and Canary — in detail, explaining how to implement each and when they're appropriate.
Why Zero Downtime Deployment Is Necessary
The Cost of Service Interruption
There was a time when briefly stopping the service during deployment was considered normal. But in modern environments with 24-hour global services, microservice architectures, and CI/CD pipelines that trigger dozens of deployments per day, service interruption equals loss.
Annual downtime by deployment frequency (assuming 5 minutes per deployment):
- Weekly: 52/year × 5 min = 260 minutes (4.3 hours) downtime
- Daily: 365/year × 5 min = 1,825 minutes (30.4 hours) downtime
- 10x daily: 3,650/year × 5 min = 18,250 minutes (304 hours) downtime
As deployment frequency increases, traditional disruptive deployments become practically impossible.
Core Requirements for Zero Downtime Deployment
- All requests must be handled normally during deployment
- Immediate rollback must be possible on deployment failure
- Data integrity must be maintained during deployment
- Users must not be aware that a deployment is in progress
Rolling Update
Rolling Update replaces instances sequentially. Rather than replacing all servers at once, it replaces one or a few at a time to update the entire fleet.
How It Works
Initial: [v1] [v1] [v1] [v1] ← All 4 running v1
Step 1: [v2] [v1] [v1] [v1] ← Server 1 being replaced (briefly removed from traffic)
Traffic restored after replacement completes
Step 2: [v2] [v2] [v1] [v1] ← Server 2 replaced
Step 3: [v2] [v2] [v2] [v1] ← Server 3 replaced
Step 4: [v2] [v2] [v2] [v2] ← Complete
Pros and Cons
Advantages:
- No additional infrastructure cost
- Relatively simple to implement
- Gradual deployment allows early detection of anomalies
Disadvantages:
- v1 and v2 serve simultaneously during deployment (backward compatibility required)
- Full rollback is slow (must replace all servers back to v1)
- Longer deployment time
Nginx Upstream Configuration (Rolling Update)
# /etc/nginx/conf.d/upstream.conf
upstream app_backend {
# max_fails: consecutive failure count, fail_timeout: duration to exclude server
server 192.168.1.20:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.21:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.22:8080 max_fails=3 fail_timeout=30s;
server 192.168.1.23:8080 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
location / {
proxy_pass http://app_backend;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
}
Rolling Update Deployment Script
#!/bin/bash
# rolling-deploy.sh
SERVERS=("192.168.1.20" "192.168.1.21" "192.168.1.22" "192.168.1.23")
APP_PORT=8080
NEW_VERSION=$1
HEALTH_CHECK_URL="http://SERVER_IP:$APP_PORT/health"
NGINX_UPSTREAM_CONF="/etc/nginx/conf.d/upstream.conf"
if [ -z "$NEW_VERSION" ]; then
echo "Usage: $0 <version>"
exit 1
fi
deploy_to_server() {
local SERVER=$1
echo "=== Deploying $NEW_VERSION to $SERVER ==="
# 1. Remove server from Nginx upstream
echo "Removing $SERVER from upstream..."
ssh root@$SERVER "curl -s -X POST http://localhost:8080/actuator/pause || true"
sleep 5 # Wait for in-progress requests to complete
# 2. Deploy new version
echo "Deploying new version..."
ssh root@$SERVER "
cd /opt/app
docker pull myapp:$NEW_VERSION
docker stop app || true
docker rm app || true
docker run -d --name app -p $APP_PORT:8080 myapp:$NEW_VERSION
"
# 3. Health check
echo "Waiting for health check..."
for i in {1..30}; do
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"http://${SERVER}:${APP_PORT}/health" 2>/dev/null)
if [ "$HTTP_CODE" = "200" ]; then
echo "$SERVER is healthy (attempt $i)"
return 0
fi
echo "Attempt $i: HTTP $HTTP_CODE, waiting..."
sleep 3
done
echo "Health check failed for $SERVER"
return 1
}
# Sequential deployment
for SERVER in "${SERVERS[@]}"; do
if ! deploy_to_server "$SERVER"; then
echo "DEPLOYMENT FAILED on $SERVER. Manual intervention required."
exit 1
fi
echo "Successfully deployed to $SERVER"
echo "---"
done
echo "Rolling update completed successfully!"
Blue-Green Deployment
Blue-Green deployment maintains two identical environments (Blue and Green) and switches traffic all at once. If Blue is currently serving, deploy the new version to Green and switch traffic from Blue to Green when ready.
How It Works
Before deployment:
Client → Nginx(VIP) → Blue(v1) [Active]
→ Green(v1) [Idle]
New version deployed:
Client → Nginx(VIP) → Blue(v1) [Active]
→ Green(v2) [Ready, under testing]
Traffic switch:
Client → Nginx(VIP) → Blue(v1) [Standby, kept for 30 minutes]
→ Green(v2) [Active]
If rollback needed:
Client → Nginx(VIP) → Blue(v1) [Immediately switched to Active]
→ Green(v2) [Standby]
Pros and Cons
Advantages:
- Immediate rollback (just switch traffic)
- Only one version serves during deployment
- New version can be thoroughly tested in the production environment before switching
Disadvantages:
- Twice the infrastructure cost
- DB schema changes must maintain backward compatibility
- Complex session handling (handling Blue session data during switch)
Nginx Blue-Green Switch Script
#!/bin/bash
# blue-green-switch.sh
NGINX_CONF_DIR="/etc/nginx/conf.d"
BLUE_UPSTREAM="upstream_blue.conf"
GREEN_UPSTREAM="upstream_green.conf"
CURRENT_SYMLINK="$NGINX_CONF_DIR/current_upstream.conf"
# Determine current active environment
get_current_env() {
if [ -L "$CURRENT_SYMLINK" ]; then
readlink "$CURRENT_SYMLINK" | grep -oP '(blue|green)'
else
echo "blue" # default
fi
}
CURRENT=$(get_current_env)
echo "Current active environment: $CURRENT"
if [ "$CURRENT" = "blue" ]; then
NEW_ENV="green"
NEW_CONF="$NGINX_CONF_DIR/$GREEN_UPSTREAM"
else
NEW_ENV="blue"
NEW_CONF="$NGINX_CONF_DIR/$BLUE_UPSTREAM"
fi
echo "Switching to $NEW_ENV environment..."
# Blue environment config
cat > "$NGINX_CONF_DIR/$BLUE_UPSTREAM" << 'EOF'
upstream app_backend {
server 192.168.1.20:8080;
server 192.168.1.21:8080;
keepalive 32;
}
EOF
# Green environment config
cat > "$NGINX_CONF_DIR/$GREEN_UPSTREAM" << 'EOF'
upstream app_backend {
server 192.168.1.30:8080;
server 192.168.1.31:8080;
keepalive 32;
}
EOF
# Atomic symlink swap
ln -sfn "$NEW_CONF" "$CURRENT_SYMLINK"
# Validate and reload Nginx
if nginx -t 2>/dev/null; then
nginx -s reload
echo "Successfully switched to $NEW_ENV environment"
echo "Previous environment ($CURRENT) remains on standby for 30 minutes"
else
# Rollback
ln -sfn "$NGINX_CONF_DIR/${CURRENT}_upstream.conf" "$CURRENT_SYMLINK"
echo "ERROR: Nginx config test failed, rolled back to $CURRENT"
exit 1
fi
Nginx Configuration (Blue-Green)
# /etc/nginx/nginx.conf
include /etc/nginx/conf.d/current_upstream.conf; # symlink
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://app_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Fallback to next server on health check failure
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
}
}
Canary Deployment
Canary deployment first applies the new version to a small portion of total traffic (e.g., 1-5%) and gradually increases the traffic ratio if no issues arise. The name comes from miners using canary birds to detect toxic gases.
How It Works
Step 1: v1(95%) + v2(5%) — Initial canary deployment
Step 2: v1(75%) + v2(25%) — Increase ratio if no issues
Step 3: v1(50%) + v2(50%) — Half traffic switched
Step 4: v1(0%) + v2(100%) — Full switch complete
Nginx Canary Configuration (Using weight)
# /etc/nginx/conf.d/canary.conf
upstream app_backend {
# Control traffic ratio with weight
server 192.168.1.20:8080 weight=95; # v1 (95%)
server 192.168.1.21:8080 weight=95; # v1 (95%)
server 192.168.1.30:8080 weight=5; # v2 Canary (5%)
keepalive 32;
}
server {
listen 80;
location / {
proxy_pass http://app_backend;
}
# Separate endpoint for canary monitoring
location /canary-status {
proxy_pass http://192.168.1.30:8080/status; # Direct access to v2
}
}
Canary Gradual Transition Script
#!/bin/bash
# canary-deploy.sh
NGINX_CONF="/etc/nginx/conf.d/canary.conf"
CANARY_STEPS=(5 10 25 50 75 100) # Progressive ratio increase
CANARY_SERVER="192.168.1.30:8080"
STABLE_SERVER1="192.168.1.20:8080"
STABLE_SERVER2="192.168.1.21:8080"
ERROR_THRESHOLD=1 # Error rate threshold (%)
WAIT_TIME=300 # Observation time per step (seconds)
check_error_rate() {
# Check error rate from Prometheus or logs (example)
ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])/rate(http_requests_total[5m])*100" \
| jq '.data.result[0].value[1]' 2>/dev/null | tr -d '"')
echo "${ERROR_RATE:-0}"
}
update_nginx_weight() {
local CANARY_WEIGHT=$1
local STABLE_WEIGHT=$((100 - CANARY_WEIGHT))
cat > "$NGINX_CONF" << EOF
upstream app_backend {
server $STABLE_SERVER1 weight=$STABLE_WEIGHT;
server $STABLE_SERVER2 weight=$STABLE_WEIGHT;
server $CANARY_SERVER weight=$CANARY_WEIGHT;
keepalive 32;
}
EOF
nginx -t && nginx -s reload
}
# Step-by-step canary deployment
for STEP in "${CANARY_STEPS[@]}"; do
echo "=== Setting canary weight to ${STEP}% ==="
update_nginx_weight $STEP
echo "Observing for ${WAIT_TIME} seconds..."
sleep $WAIT_TIME
ERROR_RATE=$(check_error_rate)
echo "Current error rate: ${ERROR_RATE}%"
if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
echo "ERROR: Error rate ${ERROR_RATE}% exceeds threshold ${ERROR_THRESHOLD}%"
echo "Rolling back to 100% stable..."
update_nginx_weight 0
exit 1
fi
echo "Error rate acceptable, proceeding to next step..."
done
echo "Canary deployment completed successfully! 100% traffic on new version."
Strategy Comparison Table
| Item | Rolling Update | Blue-Green | Canary |
|---|---|---|---|
| Deployment speed | Medium | Fast | Slow |
| Rollback ease | Slow | Immediate | Fast |
| Infrastructure cost | Same as existing | 2x | Slight increase |
| Service interruption | None | None | None |
| Mixed versions | Yes | No | Yes |
| Risk | Medium | Low | Very low |
| Suitable environment | Small scale, development | Medium-large services | Large scale, experimental |
| DB change handling | Backward compat required | Backward compat required | Backward compat required |
Tomcat Hot Deploy
Tomcat provides hot deploy functionality to deploy applications without restarting the server.
autoDeploy Configuration
<!-- /opt/tomcat/conf/server.xml -->
<Host name="localhost" appBase="webapps"
unpackWARs="true"
autoDeploy="true" <!-- Auto-deploy when WAR file change detected -->
deployOnStartup="true">
</Host>
Deployment via Tomcat Manager App
# Deploy WAR file (Manager REST API)
curl -u admin:password \
-T /path/to/new-app.war \
"http://localhost:8080/manager/text/deploy?path=/app&update=true"
# Check response
# OK - Deployed application at context path [/app]
# Restart application
curl -u admin:password \
"http://localhost:8080/manager/text/reload?path=/app"
# List deployments
curl -u admin:password \
"http://localhost:8080/manager/text/list"
WAR File Replacement Zero Downtime Script
#!/bin/bash
# tomcat-deploy.sh
TOMCAT_HOME="/opt/tomcat"
WEBAPPS_DIR="$TOMCAT_HOME/webapps"
APP_NAME="myapp"
NEW_WAR=$1
TOMCAT_MANAGER_URL="http://localhost:8080/manager/text"
TOMCAT_USER="admin"
TOMCAT_PASS="secret"
BACKUP_DIR="/opt/tomcat/backup"
if [ -z "$NEW_WAR" ] || [ ! -f "$NEW_WAR" ]; then
echo "Usage: $0 <war_file_path>"
exit 1
fi
mkdir -p "$BACKUP_DIR"
# 1. Backup current WAR
if [ -f "$WEBAPPS_DIR/${APP_NAME}.war" ]; then
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
cp "$WEBAPPS_DIR/${APP_NAME}.war" "$BACKUP_DIR/${APP_NAME}_${TIMESTAMP}.war"
echo "Backed up current WAR to $BACKUP_DIR/${APP_NAME}_${TIMESTAMP}.war"
fi
# 2. Zero-downtime deployment via Tomcat Manager
echo "Deploying new WAR via Tomcat Manager..."
RESULT=$(curl -s -u "$TOMCAT_USER:$TOMCAT_PASS" \
-T "$NEW_WAR" \
"$TOMCAT_MANAGER_URL/deploy?path=/${APP_NAME}&update=true")
echo "Deploy result: $RESULT"
if echo "$RESULT" | grep -q "^OK"; then
echo "Deployment successful!"
# 3. Health check
sleep 5
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
"http://localhost:8080/${APP_NAME}/health")
if [ "$HTTP_CODE" = "200" ]; then
echo "Health check passed (HTTP 200)"
else
echo "WARNING: Health check returned HTTP $HTTP_CODE"
echo "Consider rollback if issues persist"
fi
else
echo "ERROR: Deployment failed"
echo "Response: $RESULT"
exit 1
fi
Rolling Update in Docker Environments
# docker-compose.yml
version: '3.8'
services:
app:
image: myapp:${APP_VERSION:-latest}
deploy:
replicas: 3
update_config:
parallelism: 1 # Replace 1 at a time
delay: 10s # Interval between replacements
failure_action: rollback # Auto-rollback on failure
monitor: 30s # New container stabilization monitoring time
max_failure_ratio: 0.1 # Maximum tolerated failure ratio (10%)
rollback_config:
parallelism: 0 # Rollback all simultaneously
delay: 0s
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30s
ports:
- "8080:8080"
# Execute Docker Compose Rolling Update
APP_VERSION=v2.0.0 docker compose up -d --no-deps --build app
# Rollback
docker compose rollback app
Pro Tips
- The prerequisite for zero downtime deployment is backward-compatible APIs and a DB schema migration strategy. Change the DB schema for backward compatibility before deploying, then deploy the application afterward.
- To resolve session issues with Blue-Green deployment, use an external session store like Redis or adopt stateless authentication like JWT.
- Canary deployment can also be implemented by deploying first to specific user groups (e.g., internal staff, beta users), combined with Feature Flags.
- Set automatic rollback conditions in the deployment pipeline: error rate thresholds, response time increases, health check failures.
- Make it a rule to monitor metrics (error rate, response time, CPU/memory usage) for at least 30 minutes after each deployment.