Skip to main content

High Availability (HA) Architecture Design

In modern services, failures are an unavoidable reality. Countless causes — server hardware defects, software bugs, network disconnections, and data center power outages — can bring down a service. High Availability (HA) is a design methodology that ensures services remain continuously available even during failure scenarios. This chapter systematically covers everything from HA concepts and SPOF analysis to redundancy strategies and practical stack configurations.

What Is High Availability?

High Availability means the ability of a system to continue operating without service interruption (or with minimal interruption) even when failures occur. Availability is generally expressed with the following formula:

Availability (%) = (Uptime / Total Time) × 100

For example, targeting 99.9% annual availability allows approximately 8.76 hours of downtime per year. This figure is formalized in SLAs (Service Level Agreements) and managed as a contract between service providers and customers.

Allowable Downtime by Availability Target

Availability LevelAnnual DowntimeMonthly DowntimeWeekly DowntimeDaily Downtime
99% (Two Nines)87.6 hours7.3 hours1.68 hours14.4 minutes
99.9% (Three Nines)8.76 hours43.8 minutes10.1 minutes1.44 minutes
99.99% (Four Nines)52.6 minutes4.38 minutes1.01 minutes8.6 seconds
99.999% (Five Nines)5.26 minutes26.3 seconds6.05 seconds0.86 seconds

Most commercial services target 99.9%~99.99%. Mission-critical services like finance and healthcare may require 99.999% (Five Nines). Achieving Five Nines means only about 5 minutes of annual downtime, demanding extremely sophisticated design.

The Cost of Downtime

Service interruptions go beyond mere inconvenience — they translate directly into business losses.

Direct costs: Revenue loss (an e-commerce platform outage of one hour can mean hundreds of millions in losses), SLA penalty payments, recovery labor costs.

Indirect costs: Brand trust erosion, customer churn, reputational damage from media coverage, regulatory sanctions.

AWS suffered a major outage in 2021 that impacted thousands of services simultaneously, and Meta's 6-hour outage in 2021 was estimated to cost approximately $600 million. These examples justify the investment in HA.

SPOF (Single Point of Failure) Analysis

A SPOF is a single component whose failure causes the entire system to stop. The core of HA design is identifying and eliminating all SPOFs.

Web Server Layer SPOF

When only a single Nginx server is running, an Nginx process crash, server hardware failure, or network card defect blocks all traffic.

[Before - SPOF Present]
Client → Nginx (single) → Tomcat

[After - SPOF Eliminated]
Client → VIP (Keepalived) → Nginx Master
→ Nginx Backup (standby)

WAS (Application Server) Layer SPOF

A single Tomcat instance becomes a SPOF due to service interruption during deployment, JVM OutOfMemoryError, or process termination from application bugs.

DB Layer SPOF

The DB is the most severe SPOF. Simple redundancy is difficult due to data integrity requirements, and failures risk data loss.

[DB SPOF Resolution]
Primary DB ──(synchronous replication)──▶ Standby DB

└── On failure, Standby is promoted to Primary

Load Balancer Layer SPOF

To prevent the ironic situation where the load balancer itself becomes a SPOF, the load balancer must also be redundant. Using Keepalived + VRRP to share a VIP (Virtual IP) is the standard approach.

Network Layer SPOF

Network equipment such as switches, NICs, and uplinks can also become SPOFs. Resolve with NIC Bonding, dual-switch configurations, and multi-ISP connections.

Active-Active vs Active-Standby

HA configurations fall into two main patterns.

Active-Active

All nodes are simultaneously active and processing traffic.

Client


Load Balancer
┌─┴─┐
▼ ▼
Node A Node B ← Both handling actual requests

Advantages:

  • 100% resource utilization (no standby server waste)
  • Easy horizontal scaling (Scale-Out)
  • When one node fails, remaining nodes handle all traffic

Disadvantages:

  • Session sharing mechanism required (session clustering or stateless design)
  • Complex synchronization logic to prevent DB write conflicts
  • High configuration and operational complexity

Suitable scenarios: Stateless REST API servers, content serving servers, read-only DB replicas

Active-Standby

Only one node is active and processing traffic; the remaining node is on standby.

Client


Load Balancer
┌─┴─┐
▼ │
Node A Node B ← On standby, auto-promoted on failure
(Active) (Standby)

Advantages:

  • Simple configuration (no session sharing needed)
  • Easy to maintain data consistency
  • Well-proven pattern, high operational understanding

Disadvantages:

  • Standby server resource waste
  • Brief service interruption during failover (seconds to tens of seconds)
  • Easy to miss Standby server failures

Suitable scenarios: DB Primary-Standby, session-based servers requiring state maintenance, small-scale infrastructure

Comparison Summary

ItemActive-ActiveActive-Standby
Resource efficiencyHigh (100% utilization)Low (Standby waste)
Failover speedImmediateSeconds to tens of seconds
Configuration complexityHighLow
Session handlingShared storage requiredSimple
CostHA without extra serversExtra server costs
ScalabilityExcellentLimited

HA Architecture Design Principles

Redundancy

Configure all single components with a minimum of two instances. The N+1 principle (required count + 1 spare) is the baseline; N+2 applies to critical systems.

N+1 Principle Examples:
- 2 Nginx servers (serve with 1 when 1 fails)
- 3 Tomcat servers (serve with 2 when 1 fails)
- 2 DB servers (Primary + Standby)

Failover

The mechanism that automatically redirects traffic to healthy servers after failure detection. Manual failover results in long MTTR (Mean Time To Recovery), making HA targets difficult to achieve.

Automatic failover flow:

  1. Health check failure detected (e.g., 3 consecutive failures)
  2. Server removed from load balancer pool
  3. Backup server activated (Standby promoted to Active)
  4. Alert sent (PagerDuty, Slack, etc.)
  5. Manual recovery after root cause analysis

Auto Recovery

The mechanism that automatically restarts or replaces failed servers. systemd's Restart=always, Kubernetes Pod auto-restart, and AWS Auto Scaling instance replacement are representative examples.

# systemd auto-restart configuration example
[Service]
ExecStart=/opt/tomcat/bin/catalina.sh run
Restart=always
RestartSec=5
StartLimitInterval=60
StartLimitBurst=3

Graceful Shutdown

Complete in-progress requests before shutting down during failures or deployments. Abrupt termination can corrupt in-progress transactions.

Health Check Strategies

Health checks are the core nervous system of an HA system. Misconfigured health checks can result in removing healthy servers or leaving failed servers in place.

L4 Health Check (TCP Level)

Verifies only TCP connection success. Fast and lightweight, but cannot confirm actual application behavior.

# Nginx stream module TCP health check
upstream tomcat_backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;
}

L7 Health Check (HTTP Level)

Sends actual HTTP requests to verify response codes and content. Accurately captures the real state of the application.

# Nginx upstream health check (nginx_upstream_check_module)
upstream tomcat_backend {
server 192.168.1.10:8080;
server 192.168.1.11:8080;

check interval=3000 rise=2 fall=3 timeout=1000 type=http;
check_http_send "GET /health HTTP/1.0\r\n\r\n";
check_http_expect_alive http_2xx http_3xx;
}

Health Check Endpoint Design

// Spring Boot health check controller
@RestController
public class HealthController {

@Autowired
private DataSource dataSource;

@GetMapping("/health")
public ResponseEntity<Map<String, String>> health() {
Map<String, String> status = new HashMap<>();
try {
// Verify DB connection
dataSource.getConnection().isValid(1);
status.put("status", "UP");
status.put("db", "connected");
return ResponseEntity.ok(status);
} catch (Exception e) {
status.put("status", "DOWN");
status.put("db", "disconnected");
return ResponseEntity.status(503).body(status);
}
}
}

Practical HA Stack Example

Keepalived + Nginx + Tomcat + PostgreSQL HA Configuration

                    ┌─────────────────────────────────┐
│ VIP: 192.168.1.100 │
│ (Keepalived VRRP managed) │
└────────────┬────────────────────┘

┌──────────────────┴──────────────────┐
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Nginx Master │ │ Nginx Backup │
│ 192.168.1.10 │ │ 192.168.1.11 │
│ (VRRP Priority 100) │ │ (VRRP Priority 90) │
└──────────┬──────────┘ └─────────────────────┘
│ (VIP transferred on failure)

┌──────────────────────────────────────────────┐
│ Nginx Upstream (Load Balancing) │
└────────────┬─────────────────┬───────────────┘
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Tomcat Node 1 │ │ Tomcat Node 2 │
│ 192.168.1.20 │ │ 192.168.1.21 │
│ :8080 │ │ :8080 │
└────────┬───────┘ └───────┬────────┘
└─────────┬─────────┘

┌──────────────────────────────────────────────┐
│ PostgreSQL Primary-Standby │
│ Primary: 192.168.1.30 ──▶ Standby: 192.168.1.31 │
└──────────────────────────────────────────────┘

This configuration eliminates SPOFs at every layer. When the Nginx Master fails, Keepalived automatically transfers the VIP to the Backup. When a Tomcat node fails, Nginx upstream automatically removes it. When the PostgreSQL Primary fails, the Standby is promoted to Primary.

Important Considerations for HA Design

Split-Brain Problem: In Active-Active configurations, a network partition can cause both nodes to believe they are the Primary. Prevent this with Quorum-based decision-making and Fencing mechanisms.

Thundering Herd: After failure recovery, all requests flood in at once. Mitigate with gradual traffic transition and cache warm-up.

Operational Complexity: HA configurations increase operational complexity. Choose a level appropriate for your team's capabilities and service scale.

Pro Tips

  • Focus on reducing MTTR (Mean Time To Recovery) rather than availability numbers. Failures will always occur.
  • Use Chaos Engineering to periodically simulate failures and validate recovery procedures.
  • Design health check endpoints to include the connection status of external dependencies (DB, cache).
  • Always pair automatic failover with alerts so the operations team is notified.
  • Track actual user experience-based availability metrics (synthetic monitoring, RUM) rather than SLA figures.