Skip to main content

Load Balancing Production Tips

Making a load balancer work and running it stably in production are two different things. This page covers field-tested know-how, common mistakes, canary deployment for gradual traffic shifting, and automated failure recovery.


Canary Deployment​

Deploying a new version to all servers at once is risky. Canary deployment deploys the new version to a subset of servers (or a fraction of traffic) first, allowing early detection of problems.

Nginx Weight-based Canary Deployment​

# Step 1: Send only 5% of traffic to the new version server
upstream backend {
server app-v1-1.example.com weight=19; # Old version (95%)
server app-v1-2.example.com weight=19;
server app-v2-1.example.com weight=2; # New version (~5%)
}
# Step 2: Shift to 50% if no issues
upstream backend {
server app-v1-1.example.com weight=1;
server app-v2-1.example.com weight=1;
server app-v2-2.example.com weight=1;
}
# Step 3: Complete full transition
upstream backend {
server app-v2-1.example.com;
server app-v2-2.example.com;
server app-v2-3.example.com;
}

Monitor error rates, response times, and business metrics between each step. Roll back immediately if anomalies are detected.

Header-based Canary (Only Specific Users to New Version)​

upstream backend_v1 {
server app-v1-1.example.com;
server app-v1-2.example.com;
}

upstream backend_v2 {
server app-v2-1.example.com;
}

# Route to new version if X-Canary: true header is present
map $http_x_canary $upstream_target {
"true" "backend_v2";
default "backend_v1";
}

server {
location / {
proxy_pass http://$upstream_target;
}
}

QA teams and internal users set the X-Canary: true header to test the new version.


Gradual Traffic Migration Automation​

A script to automatically update Nginx configuration in the deployment pipeline.

#!/bin/bash
# canary-deploy.sh β€” move traffic in 10% increments

NEW_SERVER=$1
CONFIG=/etc/nginx/conf.d/backend.conf
LOG=/var/log/deploy/canary.log

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }

deploy_canary() {
local weight_old=$1 weight_new=$2

cat > "$CONFIG" <<EOF
upstream backend {
server app-v1-1.example.com weight=$weight_old;
server app-v1-2.example.com weight=$weight_old;
server ${NEW_SERVER} weight=$weight_new;
keepalive 32;
}
EOF

nginx -t && nginx -s reload
log "Traffic split: old=${weight_old} new=${weight_new}"
}

check_error_rate() {
local rate
rate=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=rate(nginx_http_requests_total{status=~"5.."}[60s])' \
| jq -r '.data.result[0].value[1]')
echo "$rate"
}

STEPS=(1 3 5 10)

for step in "${STEPS[@]}"; do
old=$((10 - step))
deploy_canary $old $step

log "Waiting 5 minutes at ${step}0% canary..."
sleep 300

error_rate=$(check_error_rate)
log "Error rate: $error_rate"

if (( $(echo "$error_rate > 0.01" | bc -l) )); then
log "ERROR RATE TOO HIGH. Rolling back..."
deploy_canary 10 0
exit 1
fi
done

log "Canary deploy successful. Switching to 100%."
deploy_canary 0 10

Automatic Node Removal and Recovery​

Simulating Nginx Failure​

# 1. Force server 1 down
sudo systemctl stop tomcat1

# 2. Check Nginx logs: verify automatic exclusion
tail -f /var/log/nginx/error.log
# [error] connect() failed (111: Connection refused) while connecting to upstream
# [warn] *1234 upstream: 10.0.0.1:8080 is temporarily removed

# 3. Recover server 1
sudo systemctl start tomcat1

# 4. Verify auto-recovery after fail_timeout
tail -f /var/log/nginx/error.log
# [info] upstream: 10.0.0.1:8080 is now back in the set of upstreams

Optimal Failure Detection Settings​

upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;

# Standby server: activated when all servers are down
server backup.example.com:8080 backup;

keepalive 32;
}

server {
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;

error_page 502 503 504 @fallback;
}

location @fallback {
return 503 '{"error":"Service temporarily unavailable","retry_after":30}';
add_header Content-Type application/json;
}
}

Common Mistakes and Solutions​

Mistake 1: Applying IP Hash Globally​

Applying ip_hash everywhere because Sticky Session is needed without session sharing.

Problem: In large NAT environments (corporate, mobile networks), thousands of users share the same IP, concentrating load on a specific server.

Solution: Implement session sharing with Redis and use Round Robin or Least Connection.

# Bad approach
upstream backend {
ip_hash;
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}

# Good approach: Redis session sharing + Least Connection
upstream backend {
least_conn;
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}

Mistake 2: Missing Timeout Configuration​

# Bad: No timeout settings (Nginx defaults: 60s read, 60s send)
location / {
proxy_pass http://backend;
}

# Good: Timeouts appropriate for service characteristics
location /api/ {
proxy_pass http://backend;
proxy_connect_timeout 5s; # Connection timeout (keep short)
proxy_read_timeout 30s; # Response wait (match API behavior)
proxy_send_timeout 30s;
}

location /upload/ {
proxy_pass http://backend;
proxy_connect_timeout 5s;
proxy_read_timeout 300s; # File uploads need longer timeout
proxy_send_timeout 300s;
client_max_body_size 100m;
}

Mistake 3: Running Without Health Checks​

# Bad: No health checks β€” requests continue to dead servers
upstream backend {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}

# Good: Passive health check + standby server
upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.3:8080 backup;
}

Mistake 4: Not Setting Upstream Keepalive​

# Bad: New TCP connection established for every request (performance degradation)
upstream backend {
server 10.0.0.1:8080;
}

# Good: Enable connection reuse
upstream backend {
server 10.0.0.1:8080;
keepalive 32;
}

server {
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}

Mistake 5: Load Balancer as Single Point of Failure​

# Bad: Single load balancer is itself a SPOF
[Single Nginx LB] β†’ App1, App2

# Good: Dual load balancers with Keepalived + VRRP
[VIP: 192.168.1.100]
β”œβ”€β”€ Nginx LB 1 (Active) ← handles traffic normally
└── Nginx LB 2 (Standby) ← takes over automatically if LB1 fails

Load Balancer Monitoring​

Prometheus + nginx-exporter​

# docker-compose.yml
services:
nginx-exporter:
image: nginx/nginx-prometheus-exporter:latest
command:
- -nginx.scrape-uri=http://nginx:8080/nginx_status
ports:
- "9113:9113"

Key metrics to monitor:

MetricThreshold (example)Meaning
Active connections> 10,000Server overload sign
5xx error rate> 1%Backend failure
Upstream response time> 500msBackend performance degradation
Excluded server count> 0Failed server detected

Log Analysis for Imbalance Detection​

# Aggregate request count per upstream server
awk '{print $NF}' /var/log/nginx/access.log \
| sort | uniq -c | sort -rn

# Average response time per upstream server
awk '{print $(NF-1), $NF}' /var/log/nginx/access.log \
| awk '{sum[$2]+=$1; cnt[$2]++} END {for(k in sum) print sum[k]/cnt[k], k}' \
| sort -n

# Find servers generating 5xx errors
grep ' 5[0-9][0-9] ' /var/log/nginx/access.log \
| awk '{print $NF}' | sort | uniq -c

Automated Rollback​

#!/bin/bash
# rollback.sh β€” auto rollback on error rate threshold exceeded

THRESHOLD=5 # Rollback if 5xx error rate exceeds 5%
CONFIG=/etc/nginx/conf.d/backend.conf
BACKUP=/etc/nginx/conf.d/backend.conf.bak

cp "$CONFIG" "$BACKUP"

deploy_new_version

for i in {1..6}; do
sleep 30
error_rate=$(get_error_rate_percent)

if (( $(echo "$error_rate > $THRESHOLD" | bc -l) )); then
echo "Error rate ${error_rate}% exceeds threshold. Rolling back..."
cp "$BACKUP" "$CONFIG"
nginx -s reload
send_alert "Auto rollback triggered: error rate $error_rate%"
exit 1
fi
done

echo "Deploy successful."
rm "$BACKUP"

Production Load Balancer Checklist​

  • max_fails and fail_timeout configured in upstream
  • Standby server (backup) or custom 503 page configured
  • keepalive configured for connection reuse
  • Per-service timeouts appropriately set
  • Health check endpoint implemented in the application
  • Retry configured with proxy_next_upstream
  • Load balancer itself is redundant (SPOF eliminated)
  • Metrics collected via stub_status or Prometheus exporter
  • Upstream response time logged ($upstream_response_time in log format)
  • Deployment automation includes rollback logic