High Availability (HA) Architecture Design
In modern services, failures are an unavoidable reality. Countless causes — server hardware defects, software bugs, network disconnections, and data center power outages — can bring down a service. High Availability (HA) is a design methodology that ensures services remain continuously available even during failure scenarios. This chapter systematically covers everything from HA concepts and SPOF analysis to redundancy strategies and practical stack configurations.
What Is High Availability?
High Availability means the ability of a system to continue operating without service interruption (or with minimal interruption) even when failures occur. Availability is generally expressed with the following formula:
Availability (%) = (Uptime / Total Time) × 100
For example, targeting 99.9% annual availability allows approximately 8.76 hours of downtime per year. This figure is formalized in SLAs (Service Level Agreements) and managed as a contract between service providers and customers.
Allowable Downtime by Availability Target
| Availability Level | Annual Downtime | Monthly Downtime | Weekly Downtime | Daily Downtime |
|---|---|---|---|---|
| 99% (Two Nines) | 87.6 hours | 7.3 hours | 1.68 hours | 14.4 minutes |
| 99.9% (Three Nines) | 8.76 hours | 43.8 minutes | 10.1 minutes | 1.44 minutes |
| 99.99% (Four Nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes | 8.6 seconds |
| 99.999% (Five Nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds | 0.86 seconds |
Most commercial services target 99.9%~99.99%. Mission-critical services like finance and healthcare may require 99.999% (Five Nines). Achieving Five Nines means only about 5 minutes of annual downtime, demanding extremely sophisticated design.
The Cost of Downtime
Service interruptions go beyond mere inconvenience — they translate directly into business losses.
Direct costs: Revenue loss (an e-commerce platform outage of one hour can mean hundreds of millions in losses), SLA penalty payments, recovery labor costs.
Indirect costs: Brand trust erosion, customer churn, reputational damage from media coverage, regulatory sanctions.
AWS suffered a major outage in 2021 that impacted thousands of services simultaneously, and Meta's 6-hour outage in 2021 was estimated to cost approximately $600 million. These examples justify the investment in HA.
SPOF (Single Point of Failure) Analysis
A SPOF is a single component whose failure causes the entire system to stop. The core of HA design is identifying and eliminating all SPOFs.
Web Server Layer SPOF
When only a single Nginx server is running, an Nginx process crash, server hardware failure, or network card defect blocks all traffic.
[Before - SPOF Present]
Client → Nginx (single) → Tomcat
[After - SPOF Eliminated]
Client → VIP (Keepalived) → Nginx Master
→ Nginx Backup (standby)
WAS (Application Server) Layer SPOF
A single Tomcat instance becomes a SPOF due to service interruption during deployment, JVM OutOfMemoryError, or process termination from application bugs.
DB Layer SPOF
The DB is the most severe SPOF. Simple redundancy is difficult due to data integrity requirements, and failures risk data loss.
[DB SPOF Resolution]
Primary DB ──(synchronous replication)──▶ Standby DB
│
└── On failure, Standby is promoted to Primary
Load Balancer Layer SPOF
To prevent the ironic situation where the load balancer itself becomes a SPOF, the load balancer must also be redundant. Using Keepalived + VRRP to share a VIP (Virtual IP) is the standard approach.
Network Layer SPOF
Network equipment such as switches, NICs, and uplinks can also become SPOFs. Resolve with NIC Bonding, dual-switch configurations, and multi-ISP connections.
Active-Active vs Active-Standby
HA configurations fall into two main patterns.
Active-Active
All nodes are simultaneously active and processing traffic.
Client
│
▼
Load Balancer
┌─┴─┐
▼ ▼
Node A Node B ← Both handling actual requests
Advantages:
- 100% resource utilization (no standby server waste)
- Easy horizontal scaling (Scale-Out)
- When one node fails, remaining nodes handle all traffic
Disadvantages:
- Session sharing mechanism required (session clustering or stateless design)
- Complex synchronization logic to prevent DB write conflicts
- High configuration and operational complexity
Suitable scenarios: Stateless REST API servers, content serving servers, read-only DB replicas
Active-Standby
Only one node is active and processing traffic; the remaining node is on standby.
Client
│
▼
Load Balancer
┌─┴─┐
▼ │
Node A Node B ← On standby, auto-promoted on failure
(Active) (Standby)
Advantages:
- Simple configuration (no session sharing needed)
- Easy to maintain data consistency
- Well-proven pattern, high operational understanding
Disadvantages:
- Standby server resource waste
- Brief service interruption during failover (seconds to tens of seconds)
- Easy to miss Standby server failures
Suitable scenarios: DB Primary-Standby, session-based servers requiring state maintenance, small-scale infrastructure
Comparison Summary
| Item | Active-Active | Active-Standby |
|---|---|---|
| Resource efficiency | High (100% utilization) | Low (Standby waste) |
| Failover speed | Immediate | Seconds to tens of seconds |
| Configuration complexity | High | Low |
| Session handling | Shared storage required | Simple |
| Cost | HA without extra servers | Extra server costs |
| Scalability | Excellent | Limited |
HA Architecture Design Principles
Redundancy
Configure all single components with a minimum of two instances. The N+1 principle (required count + 1 spare) is the baseline; N+2 applies to critical systems.
N+1 Principle Examples:
- 2 Nginx servers (serve with 1 when 1 fails)
- 3 Tomcat servers (serve with 2 when 1 fails)
- 2 DB servers (Primary + Standby)
Failover
The mechanism that automatically redirects traffic to healthy servers after failure detection. Manual failover results in long MTTR (Mean Time To Recovery), making HA targets difficult to achieve.
Automatic failover flow:
- Health check failure detected (e.g., 3 consecutive failures)
- Server removed from load balancer pool
- Backup server activated (Standby promoted to Active)
- Alert sent (PagerDuty, Slack, etc.)
- Manual recovery after root cause analysis
Auto Recovery
The mechanism that automatically restarts or replaces failed servers. systemd's Restart=always, Kubernetes Pod auto-restart, and AWS Auto Scaling instance replacement are representative examples.
# systemd auto-restart configuration example
[Service]
ExecStart=/opt/tomcat/bin/catalina.sh run
Restart=always
RestartSec=5
StartLimitInterval=60
StartLimitBurst=3
Graceful Shutdown
Complete in-progress requests before shutting down during failures or deployments. Abrupt termination can corrupt in-progress transactions.
Health Check Strategies
Health checks are the core nervous system of an HA system. Misconfigured health checks can result in removing healthy servers or leaving failed servers in place.
L4 Health Check (TCP Level)
Verifies only TCP connection success. Fast and lightweight, but cannot confirm actual application behavior.
# Nginx stream module TCP health check
upstream tomcat_backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
}
L7 Health Check (HTTP Level)
Sends actual HTTP requests to verify response codes and content. Accurately captures the real state of the application.
# Nginx upstream health check (nginx_upstream_check_module)
upstream tomcat_backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}
Health Check Endpoint Design
// Spring Boot health check controller
@RestController
public class HealthController {
@Autowired
private DataSource dataSource;
@GetMapping("/health")
public ResponseEntity<Map<String, String>> health() {
Map<String, String> status = new HashMap<>();
try {
// Verify DB connection
dataSource.getConnection().isValid(1);
status.put("status", "UP");
status.put("db", "connected");
return ResponseEntity.ok(status);
} catch (Exception e) {
status.put("status", "DOWN");
status.put("db", "disconnected");
return ResponseEntity.status(503).body(status);
}
}
}
Practical HA Stack Example
Keepalived + Nginx + Tomcat + PostgreSQL HA Configuration
┌─────────────────────────────────┐
│ VIP: 192.168.1.100 │
│ (Keepalived VRRP managed) │
└────────────┬────────────────────┘
│
┌──────────────────┴──────────────────┐
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Nginx Master │ │ Nginx Backup │
│ 192.168.1.10 │ │ 192.168.1.11 │
│ (VRRP Priority 100) │ │ (VRRP Priority 90) │
└──────────┬──────────┘ └─────────────────────┘
│ (VIP transferred on failure)
▼
┌──────────────────────────────────────────────┐
│ Nginx Upstream (Load Balancing) │
└────────────┬─────────────────┬───────────────┘
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Tomcat Node 1 │ │ Tomcat Node 2 │
│ 192.168.1.20 │ │ 192.168.1.21 │
│ :8080 │ │ :8080 │
└────────┬───────┘ └───────┬────────┘
└─────────┬─────────┘
▼
┌──────────────────────────────────────────────┐
│ PostgreSQL Primary-Standby │
│ Primary: 192.168.1.30 ──▶ Standby: 192.168.1.31 │
└──────────────────────────────────────────────┘
This configuration eliminates SPOFs at every layer. When the Nginx Master fails, Keepalived automatically transfers the VIP to the Backup. When a Tomcat node fails, Nginx upstream automatically removes it. When the PostgreSQL Primary fails, the Standby is promoted to Primary.
Important Considerations for HA Design
Split-Brain Problem: In Active-Active configurations, a network partition can cause both nodes to believe they are the Primary. Prevent this with Quorum-based decision-making and Fencing mechanisms.
Thundering Herd: After failure recovery, all requests flood in at once. Mitigate with gradual traffic transition and cache warm-up.
Operational Complexity: HA configurations increase operational complexity. Choose a level appropriate for your team's capabilities and service scale.
Pro Tips
- Focus on reducing MTTR (Mean Time To Recovery) rather than availability numbers. Failures will always occur.
- Use Chaos Engineering to periodically simulate failures and validate recovery procedures.
- Design health check endpoints to include the connection status of external dependencies (DB, cache).
- Always pair automatic failover with alerts so the operations team is notified.
- Track actual user experience-based availability metrics (synthetic monitoring, RUM) rather than SLA figures.