Zero Downtime Deployment — Rolling, Blue-Green, and Canary

Software deployment is unavoidable, but stopping the service every time you deploy is unacceptable in modern business. Zero Downtime Deployment is the technique of deploying new versions without impacting users. This chapter covers the three core strategies — Rolling Update, Blue-Green, and Canary — in detail, explaining how to implement each and when they're appropriate.

Why Zero Downtime Deployment Is Necessary

The Cost of Service Interruption

There was a time when briefly stopping the service during deployment was considered normal. But in modern environments with 24-hour global services, microservice architectures, and CI/CD pipelines that trigger dozens of deployments per day, service interruption equals loss.

Annual downtime by deployment frequency (assuming 5 minutes per deployment):
- Weekly: 52/year × 5 min = 260 minutes (4.3 hours) downtime
- Daily: 365/year × 5 min = 1,825 minutes (30.4 hours) downtime
- 10x daily: 3,650/year × 5 min = 18,250 minutes (304 hours) downtime

As deployment frequency increases, traditional disruptive deployments become practically impossible.

Core Requirements for Zero Downtime Deployment

All requests must be handled normally during deployment
Immediate rollback must be possible on deployment failure
Data integrity must be maintained during deployment
Users must not be aware that a deployment is in progress

Rolling Update

Rolling Update replaces instances sequentially. Rather than replacing all servers at once, it replaces one or a few at a time to update the entire fleet.

How It Works

Initial:  [v1] [v1] [v1] [v1]   ← All 4 running v1

Step 1:   [v2] [v1] [v1] [v1]   ← Server 1 being replaced (briefly removed from traffic)
          Traffic restored after replacement completes

Step 2:   [v2] [v2] [v1] [v1]   ← Server 2 replaced

Step 3:   [v2] [v2] [v2] [v1]   ← Server 3 replaced

Step 4:   [v2] [v2] [v2] [v2]   ← Complete

Pros and Cons

Advantages:

No additional infrastructure cost
Relatively simple to implement
Gradual deployment allows early detection of anomalies

Disadvantages:

v1 and v2 serve simultaneously during deployment (backward compatibility required)
Full rollback is slow (must replace all servers back to v1)
Longer deployment time

Nginx Upstream Configuration (Rolling Update)

# /etc/nginx/conf.d/upstream.conf
upstream app_backend {
    # max_fails: consecutive failure count, fail_timeout: duration to exclude server
    server 192.168.1.20:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.21:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.22:8080 max_fails=3 fail_timeout=30s;
    server 192.168.1.23:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;

    location / {
        proxy_pass http://app_backend;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
    }
}

Rolling Update Deployment Script

#!/bin/bash
# rolling-deploy.sh

SERVERS=("192.168.1.20" "192.168.1.21" "192.168.1.22" "192.168.1.23")
APP_PORT=8080
NEW_VERSION=$1
HEALTH_CHECK_URL="http://SERVER_IP:$APP_PORT/health"
NGINX_UPSTREAM_CONF="/etc/nginx/conf.d/upstream.conf"

if [ -z "$NEW_VERSION" ]; then
    echo "Usage: $0 <version>"
    exit 1
fi

deploy_to_server() {
    local SERVER=$1
    echo "=== Deploying $NEW_VERSION to $SERVER ==="

    # 1. Remove server from Nginx upstream
    echo "Removing $SERVER from upstream..."
    ssh root@$SERVER "curl -s -X POST http://localhost:8080/actuator/pause || true"
    sleep 5  # Wait for in-progress requests to complete

    # 2. Deploy new version
    echo "Deploying new version..."
    ssh root@$SERVER "
        cd /opt/app
        docker pull myapp:$NEW_VERSION
        docker stop app || true
        docker rm app || true
        docker run -d --name app -p $APP_PORT:8080 myapp:$NEW_VERSION
    "

    # 3. Health check
    echo "Waiting for health check..."
    for i in {1..30}; do
        HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
            "http://${SERVER}:${APP_PORT}/health" 2>/dev/null)
        if [ "$HTTP_CODE" = "200" ]; then
            echo "$SERVER is healthy (attempt $i)"
            return 0
        fi
        echo "Attempt $i: HTTP $HTTP_CODE, waiting..."
        sleep 3
    done

    echo "Health check failed for $SERVER"
    return 1
}

# Sequential deployment
for SERVER in "${SERVERS[@]}"; do
    if ! deploy_to_server "$SERVER"; then
        echo "DEPLOYMENT FAILED on $SERVER. Manual intervention required."
        exit 1
    fi
    echo "Successfully deployed to $SERVER"
    echo "---"
done

echo "Rolling update completed successfully!"

Blue-Green Deployment

Blue-Green deployment maintains two identical environments (Blue and Green) and switches traffic all at once. If Blue is currently serving, deploy the new version to Green and switch traffic from Blue to Green when ready.

How It Works

Before deployment:
Client → Nginx(VIP) → Blue(v1) [Active]
                    → Green(v1) [Idle]

New version deployed:
Client → Nginx(VIP) → Blue(v1)  [Active]
                    → Green(v2) [Ready, under testing]

Traffic switch:
Client → Nginx(VIP) → Blue(v1)  [Standby, kept for 30 minutes]
                    → Green(v2) [Active]

If rollback needed:
Client → Nginx(VIP) → Blue(v1)  [Immediately switched to Active]
                    → Green(v2) [Standby]

Pros and Cons

Advantages:

Immediate rollback (just switch traffic)
Only one version serves during deployment
New version can be thoroughly tested in the production environment before switching

Disadvantages:

Twice the infrastructure cost
DB schema changes must maintain backward compatibility
Complex session handling (handling Blue session data during switch)

Nginx Blue-Green Switch Script

#!/bin/bash
# blue-green-switch.sh

NGINX_CONF_DIR="/etc/nginx/conf.d"
BLUE_UPSTREAM="upstream_blue.conf"
GREEN_UPSTREAM="upstream_green.conf"
CURRENT_SYMLINK="$NGINX_CONF_DIR/current_upstream.conf"

# Determine current active environment
get_current_env() {
    if [ -L "$CURRENT_SYMLINK" ]; then
        readlink "$CURRENT_SYMLINK" | grep -oP '(blue|green)'
    else
        echo "blue"  # default
    fi
}

CURRENT=$(get_current_env)
echo "Current active environment: $CURRENT"

if [ "$CURRENT" = "blue" ]; then
    NEW_ENV="green"
    NEW_CONF="$NGINX_CONF_DIR/$GREEN_UPSTREAM"
else
    NEW_ENV="blue"
    NEW_CONF="$NGINX_CONF_DIR/$BLUE_UPSTREAM"
fi

echo "Switching to $NEW_ENV environment..."

# Blue environment config
cat > "$NGINX_CONF_DIR/$BLUE_UPSTREAM" << 'EOF'
upstream app_backend {
    server 192.168.1.20:8080;
    server 192.168.1.21:8080;
    keepalive 32;
}
EOF

# Green environment config
cat > "$NGINX_CONF_DIR/$GREEN_UPSTREAM" << 'EOF'
upstream app_backend {
    server 192.168.1.30:8080;
    server 192.168.1.31:8080;
    keepalive 32;
}
EOF

# Atomic symlink swap
ln -sfn "$NEW_CONF" "$CURRENT_SYMLINK"

# Validate and reload Nginx
if nginx -t 2>/dev/null; then
    nginx -s reload
    echo "Successfully switched to $NEW_ENV environment"
    echo "Previous environment ($CURRENT) remains on standby for 30 minutes"
else
    # Rollback
    ln -sfn "$NGINX_CONF_DIR/${CURRENT}_upstream.conf" "$CURRENT_SYMLINK"
    echo "ERROR: Nginx config test failed, rolled back to $CURRENT"
    exit 1
fi

Nginx Configuration (Blue-Green)

# /etc/nginx/nginx.conf
include /etc/nginx/conf.d/current_upstream.conf;  # symlink

server {
    listen 80;
    server_name example.com;

    location / {
        proxy_pass http://app_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # Fallback to next server on health check failure
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
    }
}

Canary Deployment

Canary deployment first applies the new version to a small portion of total traffic (e.g., 1-5%) and gradually increases the traffic ratio if no issues arise. The name comes from miners using canary birds to detect toxic gases.

How It Works

Step 1: v1(95%) + v2(5%)   — Initial canary deployment
Step 2: v1(75%) + v2(25%)  — Increase ratio if no issues
Step 3: v1(50%) + v2(50%)  — Half traffic switched
Step 4: v1(0%)  + v2(100%) — Full switch complete

Nginx Canary Configuration (Using weight)

# /etc/nginx/conf.d/canary.conf
upstream app_backend {
    # Control traffic ratio with weight
    server 192.168.1.20:8080 weight=95;  # v1 (95%)
    server 192.168.1.21:8080 weight=95;  # v1 (95%)
    server 192.168.1.30:8080 weight=5;   # v2 Canary (5%)

    keepalive 32;
}

server {
    listen 80;

    location / {
        proxy_pass http://app_backend;
    }

    # Separate endpoint for canary monitoring
    location /canary-status {
        proxy_pass http://192.168.1.30:8080/status;  # Direct access to v2
    }
}

Canary Gradual Transition Script

#!/bin/bash
# canary-deploy.sh

NGINX_CONF="/etc/nginx/conf.d/canary.conf"
CANARY_STEPS=(5 10 25 50 75 100)  # Progressive ratio increase
CANARY_SERVER="192.168.1.30:8080"
STABLE_SERVER1="192.168.1.20:8080"
STABLE_SERVER2="192.168.1.21:8080"
ERROR_THRESHOLD=1  # Error rate threshold (%)
WAIT_TIME=300      # Observation time per step (seconds)

check_error_rate() {
    # Check error rate from Prometheus or logs (example)
    ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(http_requests_total{status=~'5..'}[5m])/rate(http_requests_total[5m])*100" \
        | jq '.data.result[0].value[1]' 2>/dev/null | tr -d '"')
    echo "${ERROR_RATE:-0}"
}

update_nginx_weight() {
    local CANARY_WEIGHT=$1
    local STABLE_WEIGHT=$((100 - CANARY_WEIGHT))

    cat > "$NGINX_CONF" << EOF
upstream app_backend {
    server $STABLE_SERVER1 weight=$STABLE_WEIGHT;
    server $STABLE_SERVER2 weight=$STABLE_WEIGHT;
    server $CANARY_SERVER weight=$CANARY_WEIGHT;
    keepalive 32;
}
EOF

    nginx -t && nginx -s reload
}

# Step-by-step canary deployment
for STEP in "${CANARY_STEPS[@]}"; do
    echo "=== Setting canary weight to ${STEP}% ==="
    update_nginx_weight $STEP

    echo "Observing for ${WAIT_TIME} seconds..."
    sleep $WAIT_TIME

    ERROR_RATE=$(check_error_rate)
    echo "Current error rate: ${ERROR_RATE}%"

    if (( $(echo "$ERROR_RATE > $ERROR_THRESHOLD" | bc -l) )); then
        echo "ERROR: Error rate ${ERROR_RATE}% exceeds threshold ${ERROR_THRESHOLD}%"
        echo "Rolling back to 100% stable..."
        update_nginx_weight 0
        exit 1
    fi

    echo "Error rate acceptable, proceeding to next step..."
done

echo "Canary deployment completed successfully! 100% traffic on new version."

Strategy Comparison Table

Item	Rolling Update	Blue-Green	Canary
Deployment speed	Medium	Fast	Slow
Rollback ease	Slow	Immediate	Fast
Infrastructure cost	Same as existing	2x	Slight increase
Service interruption	None	None	None
Mixed versions	Yes	No	Yes
Risk	Medium	Low	Very low
Suitable environment	Small scale, development	Medium-large services	Large scale, experimental
DB change handling	Backward compat required	Backward compat required	Backward compat required

Tomcat Hot Deploy

Tomcat provides hot deploy functionality to deploy applications without restarting the server.

autoDeploy Configuration

<!-- /opt/tomcat/conf/server.xml -->
<Host name="localhost" appBase="webapps"
      unpackWARs="true"
      autoDeploy="true"       <!-- Auto-deploy when WAR file change detected -->
      deployOnStartup="true">
</Host>

Deployment via Tomcat Manager App

# Deploy WAR file (Manager REST API)
curl -u admin:password \
    -T /path/to/new-app.war \
    "http://localhost:8080/manager/text/deploy?path=/app&update=true"

# Check response
# OK - Deployed application at context path [/app]

# Restart application
curl -u admin:password \
    "http://localhost:8080/manager/text/reload?path=/app"

# List deployments
curl -u admin:password \
    "http://localhost:8080/manager/text/list"

WAR File Replacement Zero Downtime Script

#!/bin/bash
# tomcat-deploy.sh

TOMCAT_HOME="/opt/tomcat"
WEBAPPS_DIR="$TOMCAT_HOME/webapps"
APP_NAME="myapp"
NEW_WAR=$1
TOMCAT_MANAGER_URL="http://localhost:8080/manager/text"
TOMCAT_USER="admin"
TOMCAT_PASS="secret"
BACKUP_DIR="/opt/tomcat/backup"

if [ -z "$NEW_WAR" ] || [ ! -f "$NEW_WAR" ]; then
    echo "Usage: $0 <war_file_path>"
    exit 1
fi

mkdir -p "$BACKUP_DIR"

# 1. Backup current WAR
if [ -f "$WEBAPPS_DIR/${APP_NAME}.war" ]; then
    TIMESTAMP=$(date +%Y%m%d_%H%M%S)
    cp "$WEBAPPS_DIR/${APP_NAME}.war" "$BACKUP_DIR/${APP_NAME}_${TIMESTAMP}.war"
    echo "Backed up current WAR to $BACKUP_DIR/${APP_NAME}_${TIMESTAMP}.war"
fi

# 2. Zero-downtime deployment via Tomcat Manager
echo "Deploying new WAR via Tomcat Manager..."
RESULT=$(curl -s -u "$TOMCAT_USER:$TOMCAT_PASS" \
    -T "$NEW_WAR" \
    "$TOMCAT_MANAGER_URL/deploy?path=/${APP_NAME}&update=true")

echo "Deploy result: $RESULT"

if echo "$RESULT" | grep -q "^OK"; then
    echo "Deployment successful!"

    # 3. Health check
    sleep 5
    HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
        "http://localhost:8080/${APP_NAME}/health")

    if [ "$HTTP_CODE" = "200" ]; then
        echo "Health check passed (HTTP 200)"
    else
        echo "WARNING: Health check returned HTTP $HTTP_CODE"
        echo "Consider rollback if issues persist"
    fi
else
    echo "ERROR: Deployment failed"
    echo "Response: $RESULT"
    exit 1
fi

Rolling Update in Docker Environments

# docker-compose.yml
version: '3.8'
services:
  app:
    image: myapp:${APP_VERSION:-latest}
    deploy:
      replicas: 3
      update_config:
        parallelism: 1          # Replace 1 at a time
        delay: 10s              # Interval between replacements
        failure_action: rollback  # Auto-rollback on failure
        monitor: 30s            # New container stabilization monitoring time
        max_failure_ratio: 0.1  # Maximum tolerated failure ratio (10%)
      rollback_config:
        parallelism: 0          # Rollback all simultaneously
        delay: 0s
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s
    ports:
      - "8080:8080"

# Execute Docker Compose Rolling Update
APP_VERSION=v2.0.0 docker compose up -d --no-deps --build app

# Rollback
docker compose rollback app

Pro Tips

The prerequisite for zero downtime deployment is backward-compatible APIs and a DB schema migration strategy. Change the DB schema for backward compatibility before deploying, then deploy the application afterward.
To resolve session issues with Blue-Green deployment, use an external session store like Redis or adopt stateless authentication like JWT.
Canary deployment can also be implemented by deploying first to specific user groups (e.g., internal staff, beta users), combined with Feature Flags.
Set automatic rollback conditions in the deployment pipeline: error rate thresholds, response time increases, health check failures.
Make it a rule to monitor metrics (error rate, response time, CPU/memory usage) for at least 30 minutes after each deployment.

Why Zero Downtime Deployment Is Necessary​

The Cost of Service Interruption​

Core Requirements for Zero Downtime Deployment​

Rolling Update​

How It Works​

Pros and Cons​

Nginx Upstream Configuration (Rolling Update)​

Rolling Update Deployment Script​

Blue-Green Deployment​

How It Works​

Pros and Cons​

Nginx Blue-Green Switch Script​

Nginx Configuration (Blue-Green)​

Canary Deployment​

How It Works​

Nginx Canary Configuration (Using weight)​

Canary Gradual Transition Script​

Strategy Comparison Table​

Tomcat Hot Deploy​

autoDeploy Configuration​

Deployment via Tomcat Manager App​

WAR File Replacement Zero Downtime Script​

Rolling Update in Docker Environments​

Pro Tips​

Why Zero Downtime Deployment Is Necessary

The Cost of Service Interruption

Core Requirements for Zero Downtime Deployment

Rolling Update

How It Works

Pros and Cons

Nginx Upstream Configuration (Rolling Update)

Rolling Update Deployment Script

Blue-Green Deployment

How It Works

Pros and Cons

Nginx Blue-Green Switch Script

Nginx Configuration (Blue-Green)

Canary Deployment

How It Works

Nginx Canary Configuration (Using weight)

Canary Gradual Transition Script

Strategy Comparison Table

Tomcat Hot Deploy

autoDeploy Configuration

Deployment via Tomcat Manager App

WAR File Replacement Zero Downtime Script

Rolling Update in Docker Environments

Pro Tips