Skip to main content

Pro Tips — Chaos Engineering for HA Validation and Writing Failure Runbooks

What Is Chaos Engineering?

Chaos Engineering is a practice of proactively experimenting with the uncertainties of a production environment. Netflix's Chaos Monkey, developed in 2011 when the company migrated its infrastructure to AWS, is the defining example: it randomly terminates server instances in the production system to test the system's resilience.

Core Principles

  • Experiment in production: True resilience can only be validated against real traffic environments
  • Expand gradually: Start with small-scope experiments and progressively widen the blast radius
  • Abort conditions: Immediately halt an experiment if predefined SLOs (Service Level Objectives) are violated
  • Run continuously: Treat it as an ongoing validation process, not a one-time event

Netflix GameDays

A GameDay is a planned failure drill involving the entire team. Various failure scenarios are intentionally triggered in a real production environment, and the team measures how quickly it can respond and recover.

Chaos Test Scripts

Random Container Kill Test

#!/bin/bash
# chaos-container-kill.sh — Validate HA by stopping a random container

NAMESPACE="production"
EXCLUDE_CONTAINERS="nginx|database|redis"
LOG_FILE="/var/log/chaos/chaos-$(date '+%Y%m%d_%H%M%S').log"

mkdir -p /var/log/chaos

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE; }

# Precondition check (validate minimum service health)
pre_check() {
local HEALTH_URL="http://localhost/health"
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL)
if [ "$HTTP_STATUS" != "200" ]; then
log "Precondition failed: service is already unhealthy (HTTP $HTTP_STATUS)"
log "Chaos experiment canceled"
exit 1
fi
log "Precondition met: service healthy (HTTP $HTTP_STATUS)"
}

# Select and stop a random container
chaos_kill_container() {
# Running containers excluding the exclusion list
CONTAINERS=$(docker ps --format "{{.Names}}" | grep -vE "$EXCLUDE_CONTAINERS")
CONTAINER_COUNT=$(echo "$CONTAINERS" | wc -l)

if [ $CONTAINER_COUNT -eq 0 ]; then
log "No containers available to stop"
exit 0
fi

# Random selection
RANDOM_INDEX=$((RANDOM % CONTAINER_COUNT + 1))
TARGET_CONTAINER=$(echo "$CONTAINERS" | sed -n "${RANDOM_INDEX}p")

log "Chaos experiment start: stopping container '${TARGET_CONTAINER}'"
docker stop $TARGET_CONTAINER

log "Container stopped. Starting recovery monitoring..."
}

# Monitor service recovery
monitor_recovery() {
local START_TIME=$(date +%s)
local HEALTH_URL="http://localhost/health"
local MAX_WAIT=120

while true; do
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" $HEALTH_URL || echo "000")
ELAPSED=$(($(date +%s) - START_TIME))

if [ "$HTTP_STATUS" == "200" ]; then
log "Service recovered! (${ELAPSED}s elapsed)"
break
fi

if [ $ELAPSED -ge $MAX_WAIT ]; then
log "WARNING: service did not recover within ${MAX_WAIT}s (HTTP $HTTP_STATUS)"
log "Immediate manual intervention required!"
exit 1
fi

log "Waiting for recovery... (${ELAPSED}s, HTTP $HTTP_STATUS)"
sleep 5
done
}

# Main execution
pre_check
chaos_kill_container
monitor_recovery
log "Chaos experiment complete. Log: $LOG_FILE"

Network Delay Simulation with tc

#!/bin/bash
# chaos-network-delay.sh — Simulate network latency and packet loss

INTERFACE="eth0" # Target network interface
TARGET_PORT="8080" # Target port
DELAY_MS="200" # Delay in milliseconds
JITTER_MS="50" # Jitter (random variation)
PACKET_LOSS="5" # Packet loss percentage
DURATION="60" # Experiment duration (seconds)

log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }

# Apply network chaos
apply_network_chaos() {
log "Starting network delay simulation"
log "Interface: $INTERFACE, Delay: ${DELAY_MS}ms+/-${JITTER_MS}ms, Packet loss: ${PACKET_LOSS}%"

# Add qdisc rule (network delay + jitter)
tc qdisc add dev $INTERFACE root netem \
delay ${DELAY_MS}ms ${JITTER_MS}ms \
loss ${PACKET_LOSS}%

log "Network chaos applied"
}

# Check current network state
check_network_status() {
log "Current tc rules:"
tc qdisc show dev $INTERFACE
}

# Clean up network rules
cleanup_network_chaos() {
log "Removing network chaos..."
tc qdisc del dev $INTERFACE root 2>/dev/null || true
log "Network normalized"
}

# Measure service response time during experiment
measure_response_time() {
local HEALTH_URL="http://localhost:${TARGET_PORT}/health"
local RESULT=$(curl -s -o /dev/null \
-w "HTTP: %{http_code} | Total: %{time_total}s | DNS: %{time_namelookup}s | Connect: %{time_connect}s" \
$HEALTH_URL 2>/dev/null)
log "Service response: $RESULT"
}

# Always clean up even if experiment is interrupted
trap cleanup_network_chaos EXIT

apply_network_chaos
check_network_status

log "Experiment started. Monitoring service response for ${DURATION}s..."
END_TIME=$(($(date +%s) + DURATION))
while [ $(date +%s) -lt $END_TIME ]; do
measure_response_time
sleep 5
done

cleanup_network_chaos
log "Network chaos experiment complete"

Failure Runbook Template

A runbook is a standard procedure guide that on-call responders follow when an incident occurs. A well-written runbook enables fast and accurate responses even when an incident strikes at 3 AM.

Sample Runbook: Tomcat Not Responding

# Runbook: Tomcat Server Not Responding (P1)

## Symptoms
- Large number of Nginx 502 Bad Gateway responses
- Tomcat health check endpoint not responding
- Alert: "tomcat_health_check CRITICAL"

## Immediate Actions (within 5 minutes)

### Step 1: Assess the Situation
[ ] Check number of affected users (monitoring dashboard)
[ ] Check error start time (correlated with a recent deployment?)
[ ] Check Tomcat process status: systemctl status tomcat
[ ] Check port response: curl -v http://localhost:8080/health

### Step 2: Quick Recovery Attempt
[ ] Restart Tomcat: systemctl restart tomcat
[ ] Wait 60 seconds after restart
[ ] Re-check health: curl http://localhost:8080/health

### Step 3: If Restart Doesn't Resolve the Issue
[ ] Review logs: tail -n 200 /opt/tomcat/logs/catalina.out
[ ] Check for OOM: grep -i "OutOfMemoryError" /opt/tomcat/logs/catalina.out
[ ] Check disk space: df -h
[ ] Check JVM heap usage: jmap -heap $(pgrep -f catalina)

## Escalation
- Failure to recover in 5 minutes: page team lead
- Failure to recover in 15 minutes: page CTO + declare formal incident
- Continuing beyond 30 minutes: broadcast full outage notification

Pre-Deployment Checklist

#!/bin/bash
# pre-deploy-check.sh — Required checks before deployment

PASS=0
FAIL=0

check() {
local DESCRIPTION="$1"
local COMMAND="$2"
local EXPECTED="$3"

ACTUAL=$(eval "$COMMAND" 2>/dev/null || echo "ERROR")
if [ "$ACTUAL" == "$EXPECTED" ]; then
echo "PASS: $DESCRIPTION"
PASS=$((PASS + 1))
else
echo "FAIL: $DESCRIPTION (expected: $EXPECTED, actual: $ACTUAL)"
FAIL=$((FAIL + 1))
fi
}

echo "=== Pre-Deployment Checklist ==="

# 1. Service health
check "Tomcat service is running" \
"systemctl is-active tomcat" "active"

check "Nginx service is running" \
"systemctl is-active nginx" "active"

# 2. Disk space (less than 80%)
DISK_USAGE=$(df /opt/tomcat | awk 'NR==2{print $5}' | tr -d '%')
if [ "$DISK_USAGE" -lt 80 ]; then
echo "PASS: Disk usage normal (${DISK_USAGE}%)"
PASS=$((PASS + 1))
else
echo "FAIL: Disk usage too high (${DISK_USAGE}%)"
FAIL=$((FAIL + 1))
fi

# 3. Memory usage (less than 90%)
MEM_USAGE=$(free | awk '/Mem/{printf "%.0f", $3/$2*100}')
if [ "$MEM_USAGE" -lt 90 ]; then
echo "PASS: Memory usage normal (${MEM_USAGE}%)"
PASS=$((PASS + 1))
else
echo "FAIL: Memory usage too high (${MEM_USAGE}%)"
FAIL=$((FAIL + 1))
fi

# 4. Health check response
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health || echo "000")
check "Health check returns 200" "echo $HTTP_STATUS" "200"

echo ""
echo "=== Result: PASS $PASS / FAIL $FAIL ==="

if [ $FAIL -gt 0 ]; then
echo "Pre-deployment checks failed. Do NOT proceed with deployment."
exit 1
else
echo "All checks passed. Safe to proceed with deployment."
exit 0
fi

Post-Deployment Monitoring Points (Golden Signals)

Monitor these four signals after every deployment, based on the Four Golden Signals defined by Google SRE.

SignalDescriptionNormal ThresholdAlert Threshold
Latencyp99 response time< 500ms> 2000ms
TrafficRequests per second (RPS)Within ±20% of pre-deploy baseline> ±50% deviation
Errors5xx error rate< 0.1%> 1%
SaturationCPU/memory/thread pool usage< 70%> 90%

Zero-Downtime Deployment Anti-Patterns

The following behaviors are the most common anti-patterns that cause zero-downtime deployments to fail.

Things you must never do:

  1. Restoring load balancer before a health check: The most common mistake — sending traffic before the server is ready
  2. Setting drain time to zero: Terminating Tomcat immediately after removing it from Nginx cuts off in-progress requests
  3. Deploying again immediately after a deployment: Consecutive deployments before the previous one stabilizes makes root cause analysis impossible
  4. Deploying without a rollback plan: Proceeding with a deployment without deciding what to do if the health check fails
  5. A staging environment that differs from production: The root cause of "works in staging, fails in production"

Failover Drill Regular Execution Schedule

FrequencyExperimentOwnerExpected Impact
MonthlyForce-stop a single Tomcat instanceDevOps teamTraffic auto-redirects to other instances
QuarterlyNginx server restartDevOps + Dev team< 5 seconds interruption allowed
Semi-annuallyDB failover simulationDBA + DevOpsEnhanced monitoring during drill
AnnuallyFull datacenter failure simulationAll teams — GameDayAnnounce in advance

Final High-Availability Checklist

A final checklist to verify that the HA configuration is correct after a deployment.

# Final HA Checklist

## Infrastructure Configuration
[ ] Two or more Tomcat instances running
[ ] All instances registered in Nginx upstream
[ ] Health check endpoint returning 200 on all instances
[ ] Load balancer failover test completed

## Deployment Pipeline
[ ] Blue-Green or Rolling deployment confirmed
[ ] Automatic rollback script tested
[ ] Health check timeout values set appropriately (not too short)
[ ] CI/CD pipeline staging -> production branching confirmed

## Monitoring
[ ] Prometheus/Grafana dashboards operating normally
[ ] Alert rules categorized by P1/P2/P3 priority
[ ] Slack/PagerDuty notification test completed
[ ] Log aggregation (ELK/Loki) operating normally

## Incident Response
[ ] Runbooks updated to latest version
[ ] On-call rotation schedule confirmed
[ ] Escalation contact list is up to date
[ ] Failover drill conducted within the last 3 months

Chaos engineering and runbooks are not merely tools — they are the practices that build a team's resilience culture. The true essence of high-availability operations is finding and fixing system weaknesses before incidents happen, not after.