Load Balancing Production Tips
Making a load balancer work and running it stably in production are two different things. This page covers field-tested know-how, common mistakes, canary deployment for gradual traffic shifting, and automated failure recovery.
Canary Deploymentβ
Deploying a new version to all servers at once is risky. Canary deployment deploys the new version to a subset of servers (or a fraction of traffic) first, allowing early detection of problems.
Nginx Weight-based Canary Deploymentβ
# Step 1: Send only 5% of traffic to the new version server
upstream backend {
server app-v1-1.example.com weight=19; # Old version (95%)
server app-v1-2.example.com weight=19;
server app-v2-1.example.com weight=2; # New version (~5%)
}
# Step 2: Shift to 50% if no issues
upstream backend {
server app-v1-1.example.com weight=1;
server app-v2-1.example.com weight=1;
server app-v2-2.example.com weight=1;
}
# Step 3: Complete full transition
upstream backend {
server app-v2-1.example.com;
server app-v2-2.example.com;
server app-v2-3.example.com;
}
Monitor error rates, response times, and business metrics between each step. Roll back immediately if anomalies are detected.
Header-based Canary (Only Specific Users to New Version)β
upstream backend_v1 {
server app-v1-1.example.com;
server app-v1-2.example.com;
}
upstream backend_v2 {
server app-v2-1.example.com;
}
# Route to new version if X-Canary: true header is present
map $http_x_canary $upstream_target {
"true" "backend_v2";
default "backend_v1";
}
server {
location / {
proxy_pass http://$upstream_target;
}
}
QA teams and internal users set the X-Canary: true header to test the new version.
Gradual Traffic Migration Automationβ
A script to automatically update Nginx configuration in the deployment pipeline.
#!/bin/bash
# canary-deploy.sh β move traffic in 10% increments
NEW_SERVER=$1
CONFIG=/etc/nginx/conf.d/backend.conf
LOG=/var/log/deploy/canary.log
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }
deploy_canary() {
local weight_old=$1 weight_new=$2
cat > "$CONFIG" <<EOF
upstream backend {
server app-v1-1.example.com weight=$weight_old;
server app-v1-2.example.com weight=$weight_old;
server ${NEW_SERVER} weight=$weight_new;
keepalive 32;
}
EOF
nginx -t && nginx -s reload
log "Traffic split: old=${weight_old} new=${weight_new}"
}
check_error_rate() {
local rate
rate=$(curl -s "http://prometheus:9090/api/v1/query" \
--data-urlencode 'query=rate(nginx_http_requests_total{status=~"5.."}[60s])' \
| jq -r '.data.result[0].value[1]')
echo "$rate"
}
STEPS=(1 3 5 10)
for step in "${STEPS[@]}"; do
old=$((10 - step))
deploy_canary $old $step
log "Waiting 5 minutes at ${step}0% canary..."
sleep 300
error_rate=$(check_error_rate)
log "Error rate: $error_rate"
if (( $(echo "$error_rate > 0.01" | bc -l) )); then
log "ERROR RATE TOO HIGH. Rolling back..."
deploy_canary 10 0
exit 1
fi
done
log "Canary deploy successful. Switching to 100%."
deploy_canary 0 10
Automatic Node Removal and Recoveryβ
Simulating Nginx Failureβ
# 1. Force server 1 down
sudo systemctl stop tomcat1
# 2. Check Nginx logs: verify automatic exclusion
tail -f /var/log/nginx/error.log
# [error] connect() failed (111: Connection refused) while connecting to upstream
# [warn] *1234 upstream: 10.0.0.1:8080 is temporarily removed
# 3. Recover server 1
sudo systemctl start tomcat1
# 4. Verify auto-recovery after fail_timeout
tail -f /var/log/nginx/error.log
# [info] upstream: 10.0.0.1:8080 is now back in the set of upstreams
Optimal Failure Detection Settingsβ
upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
# Standby server: activated when all servers are down
server backup.example.com:8080 backup;
keepalive 32;
}
server {
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_502 http_503 http_504;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 5s;
error_page 502 503 504 @fallback;
}
location @fallback {
return 503 '{"error":"Service temporarily unavailable","retry_after":30}';
add_header Content-Type application/json;
}
}
Common Mistakes and Solutionsβ
Mistake 1: Applying IP Hash Globallyβ
Applying ip_hash everywhere because Sticky Session is needed without session sharing.
Problem: In large NAT environments (corporate, mobile networks), thousands of users share the same IP, concentrating load on a specific server.
Solution: Implement session sharing with Redis and use Round Robin or Least Connection.
# Bad approach
upstream backend {
ip_hash;
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}
# Good approach: Redis session sharing + Least Connection
upstream backend {
least_conn;
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}
Mistake 2: Missing Timeout Configurationβ
# Bad: No timeout settings (Nginx defaults: 60s read, 60s send)
location / {
proxy_pass http://backend;
}
# Good: Timeouts appropriate for service characteristics
location /api/ {
proxy_pass http://backend;
proxy_connect_timeout 5s; # Connection timeout (keep short)
proxy_read_timeout 30s; # Response wait (match API behavior)
proxy_send_timeout 30s;
}
location /upload/ {
proxy_pass http://backend;
proxy_connect_timeout 5s;
proxy_read_timeout 300s; # File uploads need longer timeout
proxy_send_timeout 300s;
client_max_body_size 100m;
}
Mistake 3: Running Without Health Checksβ
# Bad: No health checks β requests continue to dead servers
upstream backend {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}
# Good: Passive health check + standby server
upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.3:8080 backup;
}
Mistake 4: Not Setting Upstream Keepaliveβ
# Bad: New TCP connection established for every request (performance degradation)
upstream backend {
server 10.0.0.1:8080;
}
# Good: Enable connection reuse
upstream backend {
server 10.0.0.1:8080;
keepalive 32;
}
server {
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
Mistake 5: Load Balancer as Single Point of Failureβ
# Bad: Single load balancer is itself a SPOF
[Single Nginx LB] β App1, App2
# Good: Dual load balancers with Keepalived + VRRP
[VIP: 192.168.1.100]
βββ Nginx LB 1 (Active) β handles traffic normally
βββ Nginx LB 2 (Standby) β takes over automatically if LB1 fails
Load Balancer Monitoringβ
Prometheus + nginx-exporterβ
# docker-compose.yml
services:
nginx-exporter:
image: nginx/nginx-prometheus-exporter:latest
command:
- -nginx.scrape-uri=http://nginx:8080/nginx_status
ports:
- "9113:9113"
Key metrics to monitor:
| Metric | Threshold (example) | Meaning |
|---|---|---|
| Active connections | > 10,000 | Server overload sign |
| 5xx error rate | > 1% | Backend failure |
| Upstream response time | > 500ms | Backend performance degradation |
| Excluded server count | > 0 | Failed server detected |
Log Analysis for Imbalance Detectionβ
# Aggregate request count per upstream server
awk '{print $NF}' /var/log/nginx/access.log \
| sort | uniq -c | sort -rn
# Average response time per upstream server
awk '{print $(NF-1), $NF}' /var/log/nginx/access.log \
| awk '{sum[$2]+=$1; cnt[$2]++} END {for(k in sum) print sum[k]/cnt[k], k}' \
| sort -n
# Find servers generating 5xx errors
grep ' 5[0-9][0-9] ' /var/log/nginx/access.log \
| awk '{print $NF}' | sort | uniq -c
Automated Rollbackβ
#!/bin/bash
# rollback.sh β auto rollback on error rate threshold exceeded
THRESHOLD=5 # Rollback if 5xx error rate exceeds 5%
CONFIG=/etc/nginx/conf.d/backend.conf
BACKUP=/etc/nginx/conf.d/backend.conf.bak
cp "$CONFIG" "$BACKUP"
deploy_new_version
for i in {1..6}; do
sleep 30
error_rate=$(get_error_rate_percent)
if (( $(echo "$error_rate > $THRESHOLD" | bc -l) )); then
echo "Error rate ${error_rate}% exceeds threshold. Rolling back..."
cp "$BACKUP" "$CONFIG"
nginx -s reload
send_alert "Auto rollback triggered: error rate $error_rate%"
exit 1
fi
done
echo "Deploy successful."
rm "$BACKUP"
Production Load Balancer Checklistβ
-
max_failsandfail_timeoutconfigured in upstream - Standby server (backup) or custom 503 page configured
-
keepaliveconfigured for connection reuse - Per-service timeouts appropriately set
- Health check endpoint implemented in the application
- Retry configured with
proxy_next_upstream - Load balancer itself is redundant (SPOF eliminated)
- Metrics collected via
stub_statusor Prometheus exporter - Upstream response time logged (
$upstream_response_timein log format) - Deployment automation includes rollback logic