Pro Tips — Chaos Engineering for HA Validation and Writing Failure Runbooks

What Is Chaos Engineering?

Chaos Engineering is a practice of proactively experimenting with the uncertainties of a production environment. Netflix's Chaos Monkey, developed in 2011 when the company migrated its infrastructure to AWS, is the defining example: it randomly terminates server instances in the production system to test the system's resilience.

Core Principles

Experiment in production: True resilience can only be validated against real traffic environments
Expand gradually: Start with small-scope experiments and progressively widen the blast radius
Abort conditions: Immediately halt an experiment if predefined SLOs (Service Level Objectives) are violated
Run continuously: Treat it as an ongoing validation process, not a one-time event

Netflix GameDays

A GameDay is a planned failure drill involving the entire team. Various failure scenarios are intentionally triggered in a real production environment, and the team measures how quickly it can respond and recover.

Chaos Test Scripts

Random Container Kill Test

#!/bin/bash
# chaos-container-kill.sh — Validate HA by stopping a random container

NAMESPACE="production"
EXCLUDE_CONTAINERS="nginx|database|redis"
LOG_FILE="/var/log/chaos/chaos-$(date '+%Y%m%d_%H%M%S').log"

mkdir -p /var/log/chaos

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE; }

# Precondition check (validate minimum service health)
pre_check() {
    local HEALTH_URL="http://localhost/health"
    HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL)
    if [ "$HTTP_STATUS" != "200" ]; then
        log "Precondition failed: service is already unhealthy (HTTP $HTTP_STATUS)"
        log "Chaos experiment canceled"
        exit 1
    fi
    log "Precondition met: service healthy (HTTP $HTTP_STATUS)"
}

# Select and stop a random container
chaos_kill_container() {
    # Running containers excluding the exclusion list
    CONTAINERS=$(docker ps --format "{{.Names}}" | grep -vE "$EXCLUDE_CONTAINERS")
    CONTAINER_COUNT=$(echo "$CONTAINERS" | wc -l)

    if [ $CONTAINER_COUNT -eq 0 ]; then
        log "No containers available to stop"
        exit 0
    fi

    # Random selection
    RANDOM_INDEX=$((RANDOM % CONTAINER_COUNT + 1))
    TARGET_CONTAINER=$(echo "$CONTAINERS" | sed -n "${RANDOM_INDEX}p")

    log "Chaos experiment start: stopping container '${TARGET_CONTAINER}'"
    docker stop $TARGET_CONTAINER

    log "Container stopped. Starting recovery monitoring..."
}

# Monitor service recovery
monitor_recovery() {
    local START_TIME=$(date +%s)
    local HEALTH_URL="http://localhost/health"
    local MAX_WAIT=120

    while true; do
        HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL || echo "000")
        ELAPSED=$(($(date +%s) - START_TIME))

        if [ "$HTTP_STATUS" == "200" ]; then
            log "Service recovered! (${ELAPSED}s elapsed)"
            break
        fi

        if [ $ELAPSED -ge $MAX_WAIT ]; then
            log "WARNING: service did not recover within ${MAX_WAIT}s (HTTP $HTTP_STATUS)"
            log "Immediate manual intervention required!"
            exit 1
        fi

        log "Waiting for recovery... (${ELAPSED}s, HTTP $HTTP_STATUS)"
        sleep 5
    done
}

# Main execution
pre_check
chaos_kill_container
monitor_recovery
log "Chaos experiment complete. Log: $LOG_FILE"

Network Delay Simulation with `tc`

#!/bin/bash
# chaos-network-delay.sh — Simulate network latency and packet loss

INTERFACE="eth0"    # Target network interface
TARGET_PORT="8080"  # Target port
DELAY_MS="200"      # Delay in milliseconds
JITTER_MS="50"      # Jitter (random variation)
PACKET_LOSS="5"     # Packet loss percentage
DURATION="60"       # Experiment duration (seconds)

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }

# Apply network chaos
apply_network_chaos() {
    log "Starting network delay simulation"
    log "Interface: $INTERFACE, Delay: ${DELAY_MS}ms+/-${JITTER_MS}ms, Packet loss: ${PACKET_LOSS}%"

    # Add qdisc rule (network delay + jitter)
    tc qdisc add dev $INTERFACE root netem \
        delay ${DELAY_MS}ms ${JITTER_MS}ms \
        loss ${PACKET_LOSS}%

    log "Network chaos applied"
}

# Check current network state
check_network_status() {
    log "Current tc rules:"
    tc qdisc show dev $INTERFACE
}

# Clean up network rules
cleanup_network_chaos() {
    log "Removing network chaos..."
    tc qdisc del dev $INTERFACE root 2>/dev/null || true
    log "Network normalized"
}

# Measure service response time during experiment
measure_response_time() {
    local HEALTH_URL="http://localhost:${TARGET_PORT}/health"
    local RESULT=$(curl -s -o /dev/null \
        -w "HTTP: %{http_code} | Total: %{time_total}s | DNS: %{time_namelookup}s | Connect: %{time_connect}s" \
        $HEALTH_URL 2>/dev/null)
    log "Service response: $RESULT"
}

# Always clean up even if experiment is interrupted
trap cleanup_network_chaos EXIT

apply_network_chaos
check_network_status

log "Experiment started. Monitoring service response for ${DURATION}s..."
END_TIME=$(($(date +%s) + DURATION))
while [ $(date +%s) -lt $END_TIME ]; do
    measure_response_time
    sleep 5
done

cleanup_network_chaos
log "Network chaos experiment complete"

Failure Runbook Template

A runbook is a standard procedure guide that on-call responders follow when an incident occurs. A well-written runbook enables fast and accurate responses even when an incident strikes at 3 AM.

Sample Runbook: Tomcat Not Responding

# Runbook: Tomcat Server Not Responding (P1)

## Symptoms
- Large number of Nginx 502 Bad Gateway responses
- Tomcat health check endpoint not responding
- Alert: "tomcat_health_check CRITICAL"

## Immediate Actions (within 5 minutes)

### Step 1: Assess the Situation
[ ] Check number of affected users (monitoring dashboard)
[ ] Check error start time (correlated with a recent deployment?)
[ ] Check Tomcat process status: systemctl status tomcat
[ ] Check port response: curl -v http://localhost:8080/health

### Step 2: Quick Recovery Attempt
[ ] Restart Tomcat: systemctl restart tomcat
[ ] Wait 60 seconds after restart
[ ] Re-check health: curl http://localhost:8080/health

### Step 3: If Restart Doesn't Resolve the Issue
[ ] Review logs: tail -n 200 /opt/tomcat/logs/catalina.out
[ ] Check for OOM: grep -i "OutOfMemoryError" /opt/tomcat/logs/catalina.out
[ ] Check disk space: df -h
[ ] Check JVM heap usage: jmap -heap $(pgrep -f catalina)

## Escalation
- Failure to recover in 5 minutes: page team lead
- Failure to recover in 15 minutes: page CTO + declare formal incident
- Continuing beyond 30 minutes: broadcast full outage notification

Pre-Deployment Checklist

#!/bin/bash
# pre-deploy-check.sh — Required checks before deployment

PASS=0
FAIL=0

check() {
    local DESCRIPTION="$1"
    local COMMAND="$2"
    local EXPECTED="$3"

    ACTUAL=$(eval "$COMMAND" 2>/dev/null || echo "ERROR")
    if [ "$ACTUAL" == "$EXPECTED" ]; then
        echo "PASS: $DESCRIPTION"
        PASS=$((PASS + 1))
    else
        echo "FAIL: $DESCRIPTION (expected: $EXPECTED, actual: $ACTUAL)"
        FAIL=$((FAIL + 1))
    fi
}

echo "=== Pre-Deployment Checklist ==="

# 1. Service health
check "Tomcat service is running" \
    "systemctl is-active tomcat" "active"

check "Nginx service is running" \
    "systemctl is-active nginx" "active"

# 2. Disk space (less than 80%)
DISK_USAGE=$(df /opt/tomcat | awk 'NR==2{print $5}' | tr -d '%')
if [ "$DISK_USAGE" -lt 80 ]; then
    echo "PASS: Disk usage normal (${DISK_USAGE}%)"
    PASS=$((PASS + 1))
else
    echo "FAIL: Disk usage too high (${DISK_USAGE}%)"
    FAIL=$((FAIL + 1))
fi

# 3. Memory usage (less than 90%)
MEM_USAGE=$(free | awk '/Mem/{printf "%.0f", $3/$2*100}')
if [ "$MEM_USAGE" -lt 90 ]; then
    echo "PASS: Memory usage normal (${MEM_USAGE}%)"
    PASS=$((PASS + 1))
else
    echo "FAIL: Memory usage too high (${MEM_USAGE}%)"
    FAIL=$((FAIL + 1))
fi

# 4. Health check response
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health || echo "000")
check "Health check returns 200" "echo $HTTP_STATUS" "200"

echo ""
echo "=== Result: PASS $PASS / FAIL $FAIL ==="

if [ $FAIL -gt 0 ]; then
    echo "Pre-deployment checks failed. Do NOT proceed with deployment."
    exit 1
else
    echo "All checks passed. Safe to proceed with deployment."
    exit 0
fi

Post-Deployment Monitoring Points (Golden Signals)

Monitor these four signals after every deployment, based on the Four Golden Signals defined by Google SRE.

Signal	Description	Normal Threshold	Alert Threshold
Latency	p99 response time	< 500ms	> 2000ms
Traffic	Requests per second (RPS)	Within ±20% of pre-deploy baseline	> ±50% deviation
Errors	5xx error rate	< 0.1%	> 1%
Saturation	CPU/memory/thread pool usage	< 70%	> 90%

Zero-Downtime Deployment Anti-Patterns

The following behaviors are the most common anti-patterns that cause zero-downtime deployments to fail.

Things you must never do:

Restoring load balancer before a health check: The most common mistake — sending traffic before the server is ready
Setting drain time to zero: Terminating Tomcat immediately after removing it from Nginx cuts off in-progress requests
Deploying again immediately after a deployment: Consecutive deployments before the previous one stabilizes makes root cause analysis impossible
Deploying without a rollback plan: Proceeding with a deployment without deciding what to do if the health check fails
A staging environment that differs from production: The root cause of "works in staging, fails in production"

Failover Drill Regular Execution Schedule

Frequency	Experiment	Owner	Expected Impact
Monthly	Force-stop a single Tomcat instance	DevOps team	Traffic auto-redirects to other instances
Quarterly	Nginx server restart	DevOps + Dev team	< 5 seconds interruption allowed
Semi-annually	DB failover simulation	DBA + DevOps	Enhanced monitoring during drill
Annually	Full datacenter failure simulation	All teams — GameDay	Announce in advance

Final High-Availability Checklist

A final checklist to verify that the HA configuration is correct after a deployment.

# Final HA Checklist

## Infrastructure Configuration
[ ] Two or more Tomcat instances running
[ ] All instances registered in Nginx upstream
[ ] Health check endpoint returning 200 on all instances
[ ] Load balancer failover test completed

## Deployment Pipeline
[ ] Blue-Green or Rolling deployment confirmed
[ ] Automatic rollback script tested
[ ] Health check timeout values set appropriately (not too short)
[ ] CI/CD pipeline staging -> production branching confirmed

## Monitoring
[ ] Prometheus/Grafana dashboards operating normally
[ ] Alert rules categorized by P1/P2/P3 priority
[ ] Slack/PagerDuty notification test completed
[ ] Log aggregation (ELK/Loki) operating normally

## Incident Response
[ ] Runbooks updated to latest version
[ ] On-call rotation schedule confirmed
[ ] Escalation contact list is up to date
[ ] Failover drill conducted within the last 3 months

Chaos engineering and runbooks are not merely tools — they are the practices that build a team's resilience culture. The true essence of high-availability operations is finding and fixing system weaknesses before incidents happen, not after.

What Is Chaos Engineering?​

Core Principles​

Netflix GameDays​

Chaos Test Scripts​

Random Container Kill Test​

Network Delay Simulation with tc​

Failure Runbook Template​

Sample Runbook: Tomcat Not Responding​

Pre-Deployment Checklist​

Post-Deployment Monitoring Points (Golden Signals)​

Zero-Downtime Deployment Anti-Patterns​

Failover Drill Regular Execution Schedule​

Final High-Availability Checklist​