Health Check Configuration: Automatic Failure Detection and Node Removal
In a load balancing environment, health checks are an indispensable core feature. When backend servers experience failures, the load balancer must automatically detect them, block traffic to those servers, and automatically re-include them when they recover. Without health checks, users receive error responses directly.
Passive vs Active Health Checksβ
There are two types of health checks.
Passive Health Checkβ
Monitors responses to actual user requests. Automatically excludes a server when errors exceed a threshold.
User request β App1 (healthy) β
User request β App2 (error) β β count
User request β App2 (error) β β count
User request β App2 (error) β β threshold reached β App2 excluded
Subsequent β App1, App3 only
Advantage: Detects state from real requests without extra traffic. Disadvantage: Actual users receive errors before detection. Introduces detection delay.
Active Health Checkβ
The load balancer periodically sends requests to a dedicated health check endpoint to proactively verify server health.
[Load Balancer]
β (every 30 seconds)
ββββΆ App1 /health β 200 OK β (healthy)
ββββΆ App2 /health β Connection refused β β auto exclude
ββββΆ App3 /health β 200 OK β (healthy)
Advantage: Proactive detection without user impact. Much faster failure detection. Disadvantage: Generates extra traffic. Supported by Nginx Plus, HAProxy, AWS ALB, and other premium/advanced versions.
Nginx Passive Health Checkβ
Open source Nginx does not support active health checks. Instead, configure passive health checks with max_fails and fail_timeout.
upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.3:8080 max_fails=3 fail_timeout=30s;
}
How It Worksβ
max_fails(3) failures within fail_timeout(30s)
β Exclude the server for fail_timeout(30s)
β After 30s, automatically try to recover (send 1 test request)
β Success: return to rotation, failure: exclude for another 30s
What Counts as a Failureβ
By default, only connection errors and timeouts count as failures. To also count HTTP error codes (502, 503, etc.), configure proxy_next_upstream:
upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
}
server {
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
}
}
Nginx Plus Active Health Checkβ
Nginx Plus (commercial) supports active health checks.
upstream backend {
zone backend 64k; # Shared memory zone (required for active health checks)
server 10.0.0.1:8080;
server 10.0.0.2:8080;
server 10.0.0.3:8080;
}
server {
location / {
proxy_pass http://backend;
health_check
interval=10s # Check every 10 seconds
fails=3 # Exclude after 3 consecutive failures
passes=2 # Return after 2 consecutive successes
uri=/health # Health check endpoint
match=server_ok; # Response validation condition
}
}
# Define response validation conditions
match server_ok {
status 200;
header Content-Type ~ "application/json";
body ~ '"status":"UP"'; # Response body pattern matching
}
Apache mod_proxy_hcheck (Active Health Check)β
Apache 2.4.21+ supports active health checks via the mod_proxy_hcheck module.
sudo a2enmod proxy_hcheck
sudo systemctl reload apache2
<Proxy "balancer://mycluster">
BalancerMember "http://10.0.0.1:8080"
hcmethod=GET # Health check HTTP method
hcuri=/health # Health check endpoint URI
hcinterval=10 # Check interval (seconds)
hcpasses=2 # Return after N consecutive successes
hcfails=3 # Exclude after N consecutive failures
hcexpr=hc200ok # Response validation expression
BalancerMember "http://10.0.0.2:8080"
hcmethod=GET
hcuri=/health
hcinterval=10
hcpasses=2
hcfails=3
hcexpr=hc200ok
ProxySet lbmethod=bybusyness
</Proxy>
ProxyHCExpr hc200ok {%{REQUEST_STATUS} =~ /^[23]/} # 2xx, 3xx = healthy
ProxyHCExpr hcjsonok {%{body} =~ /"status"\s*:\s*"UP"/} # JSON body validation
hcmethod Optionsβ
| Value | Description |
|---|---|
GET | HTTP GET health check |
HEAD | HTTP HEAD (no body, lightweight) |
OPTIONS | HTTP OPTIONS |
TCP | TCP connection only (no HTTP needed) |
CPING | AJP CPING protocol (Tomcat-specific) |
Application Health Check Endpoint Implementationβ
For the load balancer's health checks to be meaningful, the backend application must return correct health check responses.
Spring Boot Actuator (Java)β
// build.gradle
implementation 'org.springframework.boot:spring-boot-starter-actuator'
# application.yml
management:
endpoints:
web:
exposure:
include: health
endpoint:
health:
show-details: always
Spring Boot Actuator automatically creates the /actuator/health endpoint:
{
"status": "UP",
"components": {
"db": {"status": "UP"},
"diskSpace": {"status": "UP"},
"ping": {"status": "UP"}
}
}
When the DB connection is lost, it automatically returns "status": "DOWN" with HTTP 503.
Custom Health Check Endpoint (Node.js)β
const express = require('express');
const app = express();
async function checkDatabase() {
try {
await db.query('SELECT 1');
return { status: 'UP' };
} catch (e) {
return { status: 'DOWN', error: e.message };
}
}
app.get('/health', async (req, res) => {
const dbStatus = await checkDatabase();
const health = {
status: dbStatus.status === 'UP' ? 'UP' : 'DOWN',
timestamp: new Date().toISOString(),
components: {
database: dbStatus,
}
};
const statusCode = health.status === 'UP' ? 200 : 503;
res.status(statusCode).json(health);
});
Health Check Endpoint Design Principlesβ
β
Good health check design:
- GET /health β 200 OK (service healthy)
- GET /health β 503 Service Unavailable (service unavailable)
- Includes DB connection and external service dependency checks
- Response time < 5 seconds (responds before timeout)
- No authentication required (easy load balancer access)
β Poor health check:
- Always returns 200 OK (meaningless)
- Checks all dependencies including DB (too strict, false positives)
- Contains slow logic (timeout errors)
- Requires IP restriction or authentication (blocks load balancer access)
Production Health Check Strategyβ
Layered Health Check Configurationβ
[Load Balancer]
β
βββ Shallow Check (fast, every 30 seconds)
β GET /health/ping β 200 OK
β (only checks if application process is alive)
β
βββ Deep Check (slow, every 5 minutes via external monitoring)
GET /health/deep β 200 OK
(checks DB, Redis, external API connections)
The load balancer only performs Shallow Checks. Deep Checks are handled by external monitoring tools like Prometheus or Nagios, which only generate alerts.
Health Check Log Managementβ
Health check requests filling access logs create noise. Filter them out:
# Nginx: exclude /health requests from logs
map $request_uri $loggable {
~^/health 0;
default 1;
}
access_log /var/log/nginx/access.log combined if=$loggable;
# Apache: exclude /health requests from logs
SetEnvIf Request_URI "^/health$" dontlog
CustomLog /var/log/apache2/access.log combined env=!dontlog
Recommended Health Check Settingsβ
| Environment | interval | fails | passes | fail_timeout |
|---|---|---|---|---|
| High availability service | 5s | 2 | 3 | 10s |
| General web service | 10s | 3 | 2 | 30s |
| Batch/internal service | 30s | 3 | 2 | 60s |
Key principles:
failstoo low β excluded by transient delays (false positives)failstoo high β delayed failure detection (false negatives)passestoo low β unstable servers re-enter too early (flapping)
The next page covers gradual traffic shifting (canary deployment) and failed node recovery strategies.