Health Check Configuration: Automatic Failure Detection and Node Removal

In a load balancing environment, health checks are an indispensable core feature. When backend servers experience failures, the load balancer must automatically detect them, block traffic to those servers, and automatically re-include them when they recover. Without health checks, users receive error responses directly.

Passive vs Active Health Checks

There are two types of health checks.

Health Check Flow

Passive Health Check

Monitors responses to actual user requests. Automatically excludes a server when errors exceed a threshold.

User request → App1 (healthy)  ✓
User request → App2 (error)    ✗  ← count
User request → App2 (error)    ✗  ← count
User request → App2 (error)    ✗  ← threshold reached → App2 excluded
Subsequent   → App1, App3 only

Advantage: Detects state from real requests without extra traffic. Disadvantage: Actual users receive errors before detection. Introduces detection delay.

Active Health Check

The load balancer periodically sends requests to a dedicated health check endpoint to proactively verify server health.

[Load Balancer]
    │  (every 30 seconds)
    ├──▶ App1 /health → 200 OK ✓ (healthy)
    ├──▶ App2 /health → Connection refused ✗ → auto exclude
    └──▶ App3 /health → 200 OK ✓ (healthy)

Advantage: Proactive detection without user impact. Much faster failure detection. Disadvantage: Generates extra traffic. Supported by Nginx Plus, HAProxy, AWS ALB, and other premium/advanced versions.

Nginx Passive Health Check

Open source Nginx does not support active health checks. Instead, configure passive health checks with max_fails and fail_timeout.

upstream backend {
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.3:8080 max_fails=3 fail_timeout=30s;
}

How It Works

max_fails(3) failures within fail_timeout(30s)
    → Exclude the server for fail_timeout(30s)
    → After 30s, automatically try to recover (send 1 test request)
    → Success: return to rotation, failure: exclude for another 30s

What Counts as a Failure

By default, only connection errors and timeouts count as failures. To also count HTTP error codes (502, 503, etc.), configure proxy_next_upstream:

upstream backend {
    server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
}

server {
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 3;
        proxy_next_upstream_timeout 10s;
    }
}

Nginx Plus Active Health Check

Nginx Plus (commercial) supports active health checks.

upstream backend {
    zone backend 64k;    # Shared memory zone (required for active health checks)

    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
    server 10.0.0.3:8080;
}

server {
    location / {
        proxy_pass http://backend;

        health_check
            interval=10s         # Check every 10 seconds
            fails=3              # Exclude after 3 consecutive failures
            passes=2             # Return after 2 consecutive successes
            uri=/health          # Health check endpoint
            match=server_ok;     # Response validation condition
    }
}

# Define response validation conditions
match server_ok {
    status 200;
    header Content-Type ~ "application/json";
    body ~ '"status":"UP"';    # Response body pattern matching
}

Apache mod_proxy_hcheck (Active Health Check)

Apache 2.4.21+ supports active health checks via the mod_proxy_hcheck module.

sudo a2enmod proxy_hcheck
sudo systemctl reload apache2

<Proxy "balancer://mycluster">
    BalancerMember "http://10.0.0.1:8080"
        hcmethod=GET              # Health check HTTP method
        hcuri=/health             # Health check endpoint URI
        hcinterval=10             # Check interval (seconds)
        hcpasses=2                # Return after N consecutive successes
        hcfails=3                 # Exclude after N consecutive failures
        hcexpr=hc200ok            # Response validation expression

    BalancerMember "http://10.0.0.2:8080"
        hcmethod=GET
        hcuri=/health
        hcinterval=10
        hcpasses=2
        hcfails=3
        hcexpr=hc200ok

    ProxySet lbmethod=bybusyness
</Proxy>

ProxyHCExpr hc200ok {%{REQUEST_STATUS} =~ /^[23]/}   # 2xx, 3xx = healthy
ProxyHCExpr hcjsonok {%{body} =~ /"status"\s*:\s*"UP"/}  # JSON body validation

hcmethod Options

Value	Description
`GET`	HTTP GET health check
`HEAD`	HTTP HEAD (no body, lightweight)
`OPTIONS`	HTTP OPTIONS
`TCP`	TCP connection only (no HTTP needed)
`CPING`	AJP CPING protocol (Tomcat-specific)

Application Health Check Endpoint Implementation

For the load balancer's health checks to be meaningful, the backend application must return correct health check responses.

Spring Boot Actuator (Java)

// build.gradle
implementation 'org.springframework.boot:spring-boot-starter-actuator'

# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health
  endpoint:
    health:
      show-details: always

Spring Boot Actuator automatically creates the /actuator/health endpoint:

{
  "status": "UP",
  "components": {
    "db": {"status": "UP"},
    "diskSpace": {"status": "UP"},
    "ping": {"status": "UP"}
  }
}

When the DB connection is lost, it automatically returns "status": "DOWN" with HTTP 503.

Custom Health Check Endpoint (Node.js)

const express = require('express');
const app = express();

async function checkDatabase() {
    try {
        await db.query('SELECT 1');
        return { status: 'UP' };
    } catch (e) {
        return { status: 'DOWN', error: e.message };
    }
}

app.get('/health', async (req, res) => {
    const dbStatus = await checkDatabase();

    const health = {
        status: dbStatus.status === 'UP' ? 'UP' : 'DOWN',
        timestamp: new Date().toISOString(),
        components: {
            database: dbStatus,
        }
    };

    const statusCode = health.status === 'UP' ? 200 : 503;
    res.status(statusCode).json(health);
});

Health Check Endpoint Design Principles

✅ Good health check design:
- GET /health → 200 OK  (service healthy)
- GET /health → 503 Service Unavailable  (service unavailable)
- Includes DB connection and external service dependency checks
- Response time < 5 seconds (responds before timeout)
- No authentication required (easy load balancer access)

❌ Poor health check:
- Always returns 200 OK (meaningless)
- Checks all dependencies including DB (too strict, false positives)
- Contains slow logic (timeout errors)
- Requires IP restriction or authentication (blocks load balancer access)

Production Health Check Strategy

Layered Health Check Configuration

[Load Balancer]
    │
    ├── Shallow Check (fast, every 30 seconds)
    │   GET /health/ping → 200 OK
    │   (only checks if application process is alive)
    │
    └── Deep Check (slow, every 5 minutes via external monitoring)
        GET /health/deep → 200 OK
        (checks DB, Redis, external API connections)

The load balancer only performs Shallow Checks. Deep Checks are handled by external monitoring tools like Prometheus or Nagios, which only generate alerts.

Health Check Log Management

Health check requests filling access logs create noise. Filter them out:

# Nginx: exclude /health requests from logs
map $request_uri $loggable {
    ~^/health  0;
    default    1;
}

access_log /var/log/nginx/access.log combined if=$loggable;

# Apache: exclude /health requests from logs
SetEnvIf Request_URI "^/health$" dontlog
CustomLog /var/log/apache2/access.log combined env=!dontlog

Recommended Health Check Settings

Environment	interval	fails	passes	fail_timeout
High availability service	5s	2	3	10s
General web service	10s	3	2	30s
Batch/internal service	30s	3	2	60s

Key principles:

fails too low → excluded by transient delays (false positives)
fails too high → delayed failure detection (false negatives)
passes too low → unstable servers re-enter too early (flapping)

The next page covers gradual traffic shifting (canary deployment) and failed node recovery strategies.

Passive vs Active Health Checks​

Passive Health Check​

Active Health Check​

Nginx Passive Health Check​

How It Works​

What Counts as a Failure​

Nginx Plus Active Health Check​

Apache mod_proxy_hcheck (Active Health Check)​

hcmethod Options​

Application Health Check Endpoint Implementation​

Spring Boot Actuator (Java)​

Custom Health Check Endpoint (Node.js)​

Health Check Endpoint Design Principles​

Production Health Check Strategy​

Layered Health Check Configuration​

Health Check Log Management​

Recommended Health Check Settings​