Load Balancing Production Tips

Making a load balancer work and running it stably in production are two different things. This page covers field-tested know-how, common mistakes, canary deployment for gradual traffic shifting, and automated failure recovery.

Canary Deployment

Deploying a new version to all servers at once is risky. Canary deployment deploys the new version to a subset of servers (or a fraction of traffic) first, allowing early detection of problems.

Nginx Weight-based Canary Deployment

# Step 1: Send only 5% of traffic to the new version server
upstream backend {
    server app-v1-1.example.com weight=19;  # Old version (95%)
    server app-v1-2.example.com weight=19;
    server app-v2-1.example.com weight=2;   # New version (~5%)
}

# Step 2: Shift to 50% if no issues
upstream backend {
    server app-v1-1.example.com weight=1;
    server app-v2-1.example.com weight=1;
    server app-v2-2.example.com weight=1;
}

# Step 3: Complete full transition
upstream backend {
    server app-v2-1.example.com;
    server app-v2-2.example.com;
    server app-v2-3.example.com;
}

Monitor error rates, response times, and business metrics between each step. Roll back immediately if anomalies are detected.

Header-based Canary (Only Specific Users to New Version)

upstream backend_v1 {
    server app-v1-1.example.com;
    server app-v1-2.example.com;
}

upstream backend_v2 {
    server app-v2-1.example.com;
}

# Route to new version if X-Canary: true header is present
map $http_x_canary $upstream_target {
    "true"  "backend_v2";
    default "backend_v1";
}

server {
    location / {
        proxy_pass http://$upstream_target;
    }
}

QA teams and internal users set the X-Canary: true header to test the new version.

Gradual Traffic Migration Automation

A script to automatically update Nginx configuration in the deployment pipeline.

#!/bin/bash
# canary-deploy.sh — move traffic in 10% increments

NEW_SERVER=$1
CONFIG=/etc/nginx/conf.d/backend.conf
LOG=/var/log/deploy/canary.log

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }

deploy_canary() {
    local weight_old=$1 weight_new=$2

    cat > "$CONFIG" <<EOF
upstream backend {
    server app-v1-1.example.com weight=$weight_old;
    server app-v1-2.example.com weight=$weight_old;
    server ${NEW_SERVER} weight=$weight_new;
    keepalive 32;
}
EOF

    nginx -t && nginx -s reload
    log "Traffic split: old=${weight_old} new=${weight_new}"
}

check_error_rate() {
    local rate
    rate=$(curl -s "http://prometheus:9090/api/v1/query" \
        --data-urlencode 'query=rate(nginx_http_requests_total{status=~"5.."}[60s])' \
        | jq -r '.data.result[0].value[1]')
    echo "$rate"
}

STEPS=(1 3 5 10)

for step in "${STEPS[@]}"; do
    old=$((10 - step))
    deploy_canary $old $step

    log "Waiting 5 minutes at ${step}0% canary..."
    sleep 300

    error_rate=$(check_error_rate)
    log "Error rate: $error_rate"

    if (( $(echo "$error_rate > 0.01" | bc -l) )); then
        log "ERROR RATE TOO HIGH. Rolling back..."
        deploy_canary 10 0
        exit 1
    fi
done

log "Canary deploy successful. Switching to 100%."
deploy_canary 0 10

Automatic Node Removal and Recovery

Simulating Nginx Failure

# 1. Force server 1 down
sudo systemctl stop tomcat1

# 2. Check Nginx logs: verify automatic exclusion
tail -f /var/log/nginx/error.log
# [error] connect() failed (111: Connection refused) while connecting to upstream
# [warn] *1234 upstream: 10.0.0.1:8080 is temporarily removed

# 3. Recover server 1
sudo systemctl start tomcat1

# 4. Verify auto-recovery after fail_timeout
tail -f /var/log/nginx/error.log
# [info] upstream: 10.0.0.1:8080 is now back in the set of upstreams

Optimal Failure Detection Settings

upstream backend {
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;

    # Standby server: activated when all servers are down
    server backup.example.com:8080 backup;

    keepalive 32;
}

server {
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 5s;

        error_page 502 503 504 @fallback;
    }

    location @fallback {
        return 503 '{"error":"Service temporarily unavailable","retry_after":30}';
        add_header Content-Type application/json;
    }
}

Common Mistakes and Solutions

Mistake 1: Applying IP Hash Globally

Applying ip_hash everywhere because Sticky Session is needed without session sharing.

Problem: In large NAT environments (corporate, mobile networks), thousands of users share the same IP, concentrating load on a specific server.

Solution: Implement session sharing with Redis and use Round Robin or Least Connection.

# Bad approach
upstream backend {
    ip_hash;
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

# Good approach: Redis session sharing + Least Connection
upstream backend {
    least_conn;
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

Mistake 2: Missing Timeout Configuration

# Bad: No timeout settings (Nginx defaults: 60s read, 60s send)
location / {
    proxy_pass http://backend;
}

# Good: Timeouts appropriate for service characteristics
location /api/ {
    proxy_pass http://backend;
    proxy_connect_timeout 5s;    # Connection timeout (keep short)
    proxy_read_timeout 30s;      # Response wait (match API behavior)
    proxy_send_timeout 30s;
}

location /upload/ {
    proxy_pass http://backend;
    proxy_connect_timeout 5s;
    proxy_read_timeout 300s;     # File uploads need longer timeout
    proxy_send_timeout 300s;
    client_max_body_size 100m;
}

Mistake 3: Running Without Health Checks

# Bad: No health checks — requests continue to dead servers
upstream backend {
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

# Good: Passive health check + standby server
upstream backend {
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.3:8080 backup;
}

Mistake 4: Not Setting Upstream Keepalive

# Bad: New TCP connection established for every request (performance degradation)
upstream backend {
    server 10.0.0.1:8080;
}

# Good: Enable connection reuse
upstream backend {
    server 10.0.0.1:8080;
    keepalive 32;
}

server {
    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}

Mistake 5: Load Balancer as Single Point of Failure

# Bad: Single load balancer is itself a SPOF
[Single Nginx LB] → App1, App2

# Good: Dual load balancers with Keepalived + VRRP
[VIP: 192.168.1.100]
  ├── Nginx LB 1 (Active) ← handles traffic normally
  └── Nginx LB 2 (Standby) ← takes over automatically if LB1 fails

Load Balancer Monitoring

Prometheus + nginx-exporter

# docker-compose.yml
services:
  nginx-exporter:
    image: nginx/nginx-prometheus-exporter:latest
    command:
      - -nginx.scrape-uri=http://nginx:8080/nginx_status
    ports:
      - "9113:9113"

Key metrics to monitor:

Metric	Threshold (example)	Meaning
Active connections	> 10,000	Server overload sign
5xx error rate	> 1%	Backend failure
Upstream response time	> 500ms	Backend performance degradation
Excluded server count	> 0	Failed server detected

Log Analysis for Imbalance Detection

# Aggregate request count per upstream server
awk '{print $NF}' /var/log/nginx/access.log \
    | sort | uniq -c | sort -rn

# Average response time per upstream server
awk '{print $(NF-1), $NF}' /var/log/nginx/access.log \
    | awk '{sum[$2]+=$1; cnt[$2]++} END {for(k in sum) print sum[k]/cnt[k], k}' \
    | sort -n

# Find servers generating 5xx errors
grep ' 5[0-9][0-9] ' /var/log/nginx/access.log \
    | awk '{print $NF}' | sort | uniq -c

Automated Rollback

#!/bin/bash
# rollback.sh — auto rollback on error rate threshold exceeded

THRESHOLD=5     # Rollback if 5xx error rate exceeds 5%
CONFIG=/etc/nginx/conf.d/backend.conf
BACKUP=/etc/nginx/conf.d/backend.conf.bak

cp "$CONFIG" "$BACKUP"

deploy_new_version

for i in {1..6}; do
    sleep 30
    error_rate=$(get_error_rate_percent)

    if (( $(echo "$error_rate > $THRESHOLD" | bc -l) )); then
        echo "Error rate ${error_rate}% exceeds threshold. Rolling back..."
        cp "$BACKUP" "$CONFIG"
        nginx -s reload
        send_alert "Auto rollback triggered: error rate $error_rate%"
        exit 1
    fi
done

echo "Deploy successful."
rm "$BACKUP"

Canary Deployment​

Nginx Weight-based Canary Deployment​

Header-based Canary (Only Specific Users to New Version)​

Gradual Traffic Migration Automation​

Automatic Node Removal and Recovery​

Simulating Nginx Failure​

Optimal Failure Detection Settings​

Common Mistakes and Solutions​

Mistake 1: Applying IP Hash Globally​

Mistake 2: Missing Timeout Configuration​

Mistake 3: Running Without Health Checks​

Mistake 4: Not Setting Upstream Keepalive​

Mistake 5: Load Balancer as Single Point of Failure​

Load Balancer Monitoring​

Prometheus + nginx-exporter​

Log Analysis for Imbalance Detection​

Automated Rollback​

Production Load Balancer Checklist​