로드밸런싱 실전 고수 팁

로드밸런서를 단순히 동작하게 만드는 것과 프로덕션에서 안정적으로 운영하는 것은 다릅니다. 이 페이지에서는 현업에서 배운 노하우, 흔한 실수, 점진적 트래픽 이동(카나리 배포), 장애 복구 자동화까지 실무 팁을 정리합니다.

카나리 배포 (Canary Deployment)

새 버전을 전체 서버에 한 번에 배포하는 것은 위험합니다. 카나리 배포는 일부 서버(또는 일부 트래픽)에만 새 버전을 먼저 배포해 문제를 조기에 발견하는 기법입니다.

Nginx 가중치를 이용한 카나리 배포

# 1단계: 신규 버전 서버에 트래픽 5%만 전송
upstream backend {
    server app-v1-1.example.com weight=19;  # 기존 버전 (95%)
    server app-v1-2.example.com weight=19;
    server app-v2-1.example.com weight=2;   # 신규 버전 (약 5%)
}

# 2단계: 문제 없으면 50% 전환
upstream backend {
    server app-v1-1.example.com weight=1;
    server app-v2-1.example.com weight=1;
    server app-v2-2.example.com weight=1;
}

# 3단계: 전체 전환 완료
upstream backend {
    server app-v2-1.example.com;
    server app-v2-2.example.com;
    server app-v2-3.example.com;
}

각 단계 사이에 에러율, 응답 시간, 비즈니스 지표를 모니터링합니다. 이상 감지 시 즉시 롤백합니다.

헤더 기반 카나리 (특정 사용자만 신버전으로)

upstream backend_v1 {
    server app-v1-1.example.com;
    server app-v1-2.example.com;
}

upstream backend_v2 {
    server app-v2-1.example.com;
}

# X-Canary: true 헤더가 있으면 신버전으로
map $http_x_canary $upstream_target {
    "true"  "backend_v2";
    default "backend_v1";
}

server {
    location / {
        proxy_pass http://$upstream_target;
    }
}

QA팀이나 내부 직원만 X-Canary: true 헤더를 설정해 신버전을 테스트합니다.

점진적 트래픽 이동 자동화

배포 파이프라인에서 Nginx 설정을 자동으로 변경하는 스크립트입니다.

#!/bin/bash
# canary-deploy.sh — 10% 단위로 트래픽 이동

NEW_SERVER=$1      # 예: app-v2-1.example.com
CONFIG=/etc/nginx/conf.d/backend.conf
LOG=/var/log/deploy/canary.log

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" | tee -a "$LOG"; }

deploy_canary() {
    local weight_old=$1 weight_new=$2

    cat > "$CONFIG" <<EOF
upstream backend {
    server app-v1-1.example.com weight=$weight_old;
    server app-v1-2.example.com weight=$weight_old;
    server ${NEW_SERVER} weight=$weight_new;
    keepalive 32;
}
EOF

    nginx -t && nginx -s reload
    log "Traffic split: old=${weight_old} new=${weight_new}"
}

check_error_rate() {
    # 최근 60초간 5xx 에러율 확인 (Prometheus 쿼리 또는 로그 파싱)
    local rate
    rate=$(curl -s "http://prometheus:9090/api/v1/query" \
        --data-urlencode 'query=rate(nginx_http_requests_total{status=~"5.."}[60s])' \
        | jq -r '.data.result[0].value[1]')
    echo "$rate"
}

# 단계별 카나리 배포
STEPS=(1 3 5 10)

for step in "${STEPS[@]}"; do
    old=$((10 - step))
    deploy_canary $old $step

    log "Waiting 5 minutes at ${step}0% canary..."
    sleep 300

    error_rate=$(check_error_rate)
    log "Error rate: $error_rate"

    if (( $(echo "$error_rate > 0.01" | bc -l) )); then
        log "ERROR RATE TOO HIGH. Rolling back..."
        deploy_canary 10 0
        exit 1
    fi
done

log "Canary deploy successful. Switching to 100%."
deploy_canary 0 10

장애 노드 자동 제거와 복구

Nginx 장애 시뮬레이션

# 1. 서버 1을 강제로 다운
sudo systemctl stop tomcat1

# 2. Nginx 로그 확인: 자동 배제 동작 확인
tail -f /var/log/nginx/error.log
# [error] connect() failed (111: Connection refused) while connecting to upstream
# [warn] *1234 upstream: 10.0.0.1:8080 is temporarily removed from the set of upstreams

# 3. 서버 1 복구
sudo systemctl start tomcat1

# 4. fail_timeout 후 자동 복귀 확인
tail -f /var/log/nginx/error.log
# [info] upstream: 10.0.0.1:8080 is now back in the set of upstreams

Nginx 설정에서 최적의 장애 감지 설정

upstream backend {
    # 30초 안에 3번 실패 → 30초 배제
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;

    # 예비 서버: 모든 서버 다운 시 활성화
    server backup.example.com:8080 backup;

    # 사용자 정의 503 대신 항상 응답
    keepalive 32;
}

server {
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 5s;

        # 연결이 불가능할 때 커스텀 에러 페이지
        error_page 502 503 504 @fallback;
    }

    location @fallback {
        return 503 '{"error":"Service temporarily unavailable","retry_after":30}';
        add_header Content-Type application/json;
    }
}

흔한 실수와 해결책

실수 1: 모든 요청에 IP Hash 적용

세션 공유 없이 Sticky Session이 필요하다는 이유로 ip_hash를 전체에 적용하는 경우가 있습니다.

문제: 대형 NAT 환경(기업, 이동통신망)에서 수천 명이 같은 IP를 사용하면 특정 서버에 부하가 집중됩니다.

해결: Redis를 이용한 세션 공유를 구현하고 Round Robin이나 Least Connection을 사용합니다.

# ❌ 잘못된 접근
upstream backend {
    ip_hash;
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

# ✅ 올바른 접근: Redis 세션 공유 + Least Connection
upstream backend {
    least_conn;
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

실수 2: 타임아웃 설정 누락

# ❌ 타임아웃 미설정: Nginx 기본값 (60s read, 60s send)
location / {
    proxy_pass http://backend;
}

# ✅ 서비스 특성에 맞는 타임아웃 설정
location /api/ {
    proxy_pass http://backend;
    proxy_connect_timeout 5s;    # 연결 타임아웃 (짧게)
    proxy_read_timeout 30s;      # 응답 대기 (API 특성에 맞게)
    proxy_send_timeout 30s;
}

location /upload/ {
    proxy_pass http://backend;
    proxy_connect_timeout 5s;
    proxy_read_timeout 300s;     # 파일 업로드는 길게
    proxy_send_timeout 300s;
    client_max_body_size 100m;
}

실수 3: 헬스체크 없이 운영

# ❌ 헬스체크 없음: 죽은 서버에도 계속 요청 전송
upstream backend {
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
}

# ✅ Passive 헬스체크 + 예비 서버 설정
upstream backend {
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.3:8080 backup;
}

실수 4: 업스트림 keepalive 미설정

# ❌ 매 요청마다 TCP 연결 새로 맺음 (성능 저하)
upstream backend {
    server 10.0.0.1:8080;
}

# ✅ 연결 재사용 설정
upstream backend {
    server 10.0.0.1:8080;
    keepalive 32;
}

server {
    location / {
        proxy_pass http://backend;
        proxy_http_version 1.1;        # keepalive 필수
        proxy_set_header Connection ""; # Connection: close 헤더 제거
    }
}

실수 5: 로드밸런서 자체 SPOF

# ❌ 로드밸런서가 단일 서버일 때: 로드밸런서 자체가 단일 장애점
[단일 Nginx LB] → App1, App2

# ✅ Keepalived + VRRP로 로드밸런서 이중화
[VIP: 192.168.1.100]
  ├── Nginx LB 1 (Active) ← 평시 트래픽 처리
  └── Nginx LB 2 (Standby) ← LB1 장애 시 자동 인수

로드밸런서 모니터링 체계

필수 모니터링 메트릭

# Nginx stub_status로 기본 메트릭 수집
curl http://localhost:8080/nginx_status
# Active connections: 291
# accepts: 16630948, handled: 16630948, requests: 31070465
# Reading: 6 Writing: 179 Waiting: 106

Prometheus + nginx-exporter로 자동 수집:

# docker-compose.yml
services:
  nginx-exporter:
    image: nginx/nginx-prometheus-exporter:latest
    command:
      - -nginx.scrape-uri=http://nginx:8080/nginx_status
    ports:
      - "9113:9113"

모니터링해야 할 핵심 메트릭:

메트릭	임계값 (예시)	의미
활성 연결 수	> 10,000	서버 과부하 징후
5xx 에러율	> 1%	백엔드 장애
업스트림 응답 시간	> 500ms	백엔드 성능 저하
배제된 서버 수	> 0	장애 서버 발생

로그 분석으로 불균형 감지

# 각 업스트림 서버별 요청 수 집계
awk '{print $NF}' /var/log/nginx/access.log \
    | sort | uniq -c | sort -rn

# 업스트림 서버별 평균 응답 시간
awk '{print $(NF-1), $NF}' /var/log/nginx/access.log \
    | awk '{sum[$2]+=$1; cnt[$2]++} END {for(k in sum) print sum[k]/cnt[k], k}' \
    | sort -n

# 5xx 에러 서버 찾기
grep ' 5[0-9][0-9] ' /var/log/nginx/access.log \
    | awk '{print $NF}' | sort | uniq -c

배포 롤백 자동화

#!/bin/bash
# rollback.sh — 에러율 임계값 초과 시 자동 롤백

THRESHOLD=5     # 5xx 에러율 5% 초과 시 롤백
CONFIG=/etc/nginx/conf.d/backend.conf
BACKUP=/etc/nginx/conf.d/backend.conf.bak

# 배포 전 현재 설정 백업
cp "$CONFIG" "$BACKUP"

# 새 설정 적용
deploy_new_version

# 3분간 에러율 모니터링
for i in {1..6}; do
    sleep 30
    error_rate=$(get_error_rate_percent)

    if (( $(echo "$error_rate > $THRESHOLD" | bc -l) )); then
        echo "Error rate ${error_rate}% exceeds threshold. Rolling back..."
        cp "$BACKUP" "$CONFIG"
        nginx -s reload
        send_alert "Auto rollback triggered: error rate $error_rate%"
        exit 1
    fi
done

echo "Deploy successful. Cleaning up backup."
rm "$BACKUP"

카나리 배포 (Canary Deployment)​

Nginx 가중치를 이용한 카나리 배포​

헤더 기반 카나리 (특정 사용자만 신버전으로)​

점진적 트래픽 이동 자동화​

장애 노드 자동 제거와 복구​

Nginx 장애 시뮬레이션​

Nginx 설정에서 최적의 장애 감지 설정​

흔한 실수와 해결책​

실수 1: 모든 요청에 IP Hash 적용​

실수 2: 타임아웃 설정 누락​

실수 3: 헬스체크 없이 운영​

실수 4: 업스트림 keepalive 미설정​

실수 5: 로드밸런서 자체 SPOF​

로드밸런서 모니터링 체계​

필수 모니터링 메트릭​

로그 분석으로 불균형 감지​

배포 롤백 자동화​

체크리스트: 프로덕션 로드밸런서 점검​