Skip to main content

Health Check Configuration: Automatic Failure Detection and Node Removal

In a load balancing environment, health checks are an indispensable core feature. When backend servers experience failures, the load balancer must automatically detect them, block traffic to those servers, and automatically re-include them when they recover. Without health checks, users receive error responses directly.


Passive vs Active Health Checks​

There are two types of health checks.

Health Check Flow

Passive Health Check​

Monitors responses to actual user requests. Automatically excludes a server when errors exceed a threshold.

User request β†’ App1 (healthy)  βœ“
User request β†’ App2 (error) βœ— ← count
User request β†’ App2 (error) βœ— ← count
User request β†’ App2 (error) βœ— ← threshold reached β†’ App2 excluded
Subsequent β†’ App1, App3 only

Advantage: Detects state from real requests without extra traffic. Disadvantage: Actual users receive errors before detection. Introduces detection delay.

Active Health Check​

The load balancer periodically sends requests to a dedicated health check endpoint to proactively verify server health.

[Load Balancer]
β”‚ (every 30 seconds)
β”œβ”€β”€β–Ά App1 /health β†’ 200 OK βœ“ (healthy)
β”œβ”€β”€β–Ά App2 /health β†’ Connection refused βœ— β†’ auto exclude
└──▢ App3 /health β†’ 200 OK βœ“ (healthy)

Advantage: Proactive detection without user impact. Much faster failure detection. Disadvantage: Generates extra traffic. Supported by Nginx Plus, HAProxy, AWS ALB, and other premium/advanced versions.


Nginx Passive Health Check​

Open source Nginx does not support active health checks. Instead, configure passive health checks with max_fails and fail_timeout.

upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.3:8080 max_fails=3 fail_timeout=30s;
}

How It Works​

max_fails(3) failures within fail_timeout(30s)
β†’ Exclude the server for fail_timeout(30s)
β†’ After 30s, automatically try to recover (send 1 test request)
β†’ Success: return to rotation, failure: exclude for another 30s

What Counts as a Failure​

By default, only connection errors and timeouts count as failures. To also count HTTP error codes (502, 503, etc.), configure proxy_next_upstream:

upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
}

server {
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
}
}

Nginx Plus Active Health Check​

Nginx Plus (commercial) supports active health checks.

upstream backend {
zone backend 64k; # Shared memory zone (required for active health checks)

server 10.0.0.1:8080;
server 10.0.0.2:8080;
server 10.0.0.3:8080;
}

server {
location / {
proxy_pass http://backend;

health_check
interval=10s # Check every 10 seconds
fails=3 # Exclude after 3 consecutive failures
passes=2 # Return after 2 consecutive successes
uri=/health # Health check endpoint
match=server_ok; # Response validation condition
}
}

# Define response validation conditions
match server_ok {
status 200;
header Content-Type ~ "application/json";
body ~ '"status":"UP"'; # Response body pattern matching
}

Apache mod_proxy_hcheck (Active Health Check)​

Apache 2.4.21+ supports active health checks via the mod_proxy_hcheck module.

sudo a2enmod proxy_hcheck
sudo systemctl reload apache2
<Proxy "balancer://mycluster">
BalancerMember "http://10.0.0.1:8080"
hcmethod=GET # Health check HTTP method
hcuri=/health # Health check endpoint URI
hcinterval=10 # Check interval (seconds)
hcpasses=2 # Return after N consecutive successes
hcfails=3 # Exclude after N consecutive failures
hcexpr=hc200ok # Response validation expression

BalancerMember "http://10.0.0.2:8080"
hcmethod=GET
hcuri=/health
hcinterval=10
hcpasses=2
hcfails=3
hcexpr=hc200ok

ProxySet lbmethod=bybusyness
</Proxy>

ProxyHCExpr hc200ok {%{REQUEST_STATUS} =~ /^[23]/} # 2xx, 3xx = healthy
ProxyHCExpr hcjsonok {%{body} =~ /"status"\s*:\s*"UP"/} # JSON body validation

hcmethod Options​

ValueDescription
GETHTTP GET health check
HEADHTTP HEAD (no body, lightweight)
OPTIONSHTTP OPTIONS
TCPTCP connection only (no HTTP needed)
CPINGAJP CPING protocol (Tomcat-specific)

Application Health Check Endpoint Implementation​

For the load balancer's health checks to be meaningful, the backend application must return correct health check responses.

Spring Boot Actuator (Java)​

// build.gradle
implementation 'org.springframework.boot:spring-boot-starter-actuator'
# application.yml
management:
endpoints:
web:
exposure:
include: health
endpoint:
health:
show-details: always

Spring Boot Actuator automatically creates the /actuator/health endpoint:

{
"status": "UP",
"components": {
"db": {"status": "UP"},
"diskSpace": {"status": "UP"},
"ping": {"status": "UP"}
}
}

When the DB connection is lost, it automatically returns "status": "DOWN" with HTTP 503.

Custom Health Check Endpoint (Node.js)​

const express = require('express');
const app = express();

async function checkDatabase() {
try {
await db.query('SELECT 1');
return { status: 'UP' };
} catch (e) {
return { status: 'DOWN', error: e.message };
}
}

app.get('/health', async (req, res) => {
const dbStatus = await checkDatabase();

const health = {
status: dbStatus.status === 'UP' ? 'UP' : 'DOWN',
timestamp: new Date().toISOString(),
components: {
database: dbStatus,
}
};

const statusCode = health.status === 'UP' ? 200 : 503;
res.status(statusCode).json(health);
});

Health Check Endpoint Design Principles​

βœ… Good health check design:
- GET /health β†’ 200 OK (service healthy)
- GET /health β†’ 503 Service Unavailable (service unavailable)
- Includes DB connection and external service dependency checks
- Response time < 5 seconds (responds before timeout)
- No authentication required (easy load balancer access)

❌ Poor health check:
- Always returns 200 OK (meaningless)
- Checks all dependencies including DB (too strict, false positives)
- Contains slow logic (timeout errors)
- Requires IP restriction or authentication (blocks load balancer access)

Production Health Check Strategy​

Layered Health Check Configuration​

[Load Balancer]
β”‚
β”œβ”€β”€ Shallow Check (fast, every 30 seconds)
β”‚ GET /health/ping β†’ 200 OK
β”‚ (only checks if application process is alive)
β”‚
└── Deep Check (slow, every 5 minutes via external monitoring)
GET /health/deep β†’ 200 OK
(checks DB, Redis, external API connections)

The load balancer only performs Shallow Checks. Deep Checks are handled by external monitoring tools like Prometheus or Nagios, which only generate alerts.

Health Check Log Management​

Health check requests filling access logs create noise. Filter them out:

# Nginx: exclude /health requests from logs
map $request_uri $loggable {
~^/health 0;
default 1;
}

access_log /var/log/nginx/access.log combined if=$loggable;
# Apache: exclude /health requests from logs
SetEnvIf Request_URI "^/health$" dontlog
CustomLog /var/log/apache2/access.log combined env=!dontlog

Environmentintervalfailspassesfail_timeout
High availability service5s2310s
General web service10s3230s
Batch/internal service30s3260s

Key principles:

  • fails too low β†’ excluded by transient delays (false positives)
  • fails too high β†’ delayed failure detection (false negatives)
  • passes too low β†’ unstable servers re-enter too early (flapping)

The next page covers gradual traffic shifting (canary deployment) and failed node recovery strategies.