Performance Optimization Pro Tips
Setting up caching, compression, and HTTP/2 is just the start. You need to validate how much faster things actually got with load tests and pinpoint bottlenecks precisely to complete a true performance optimization.
Load Testing Tool Comparisonβ
| Tool | Highlights | Protocol | Best For |
|---|---|---|---|
ab (Apache Bench) | Simple, quick to install | HTTP/1.1 | Fast baseline measurement |
wrk | High-performance, Lua scripts | HTTP/1.1 | Throughput & latency |
k6 | JS scenarios, visualization | HTTP/1.1β2 | Scenario-based load tests |
h2load | HTTP/2 dedicated | HTTP/2 | HTTP/2 performance validation |
locust | Python, distributed | HTTP | Complex user-flow scenarios |
ab (Apache Bench)β
# Basic usage
# -n: total requests, -c: concurrent connections, -k: Keep-Alive
ab -n 10000 -c 100 -k https://example.com/api/products
# Reading the output
# Requests per second: 1234.56 [#/sec] β throughput (higher is better)
# Time per request: 81.00 [ms] β mean latency
# Transfer rate: 456.78 [Kbytes/sec]
# Percentile distribution (p99 > 1000 ms needs attention)
# 50% 45 ms
# 75% 78 ms
# 90% 120 ms
# 95% 180 ms
# 99% 450 ms
# 100% 1200 ms (slowest request)
# POST request test
ab -n 5000 -c 50 -k \
-H "Content-Type: application/json" \
-p /tmp/payload.json \
https://example.com/api/orders
wrk β High-Performance Benchmarkβ
# Install
sudo apt install wrk
# Basic usage
# -t: threads, -c: connections, -d: duration
wrk -t4 -c100 -d30s https://example.com/api/products
# Running 30s test @ https://example.com/api/products
# 4 threads and 100 connections
#
# Thread Stats Avg Stdev Max +/- Stdev
# Latency 45.67ms 12.34ms 234.56ms 87.65%
# Req/Sec 567.89 45.67 678.90 78.90%
# 68145 requests in 30s, 12.34MB read
# Requests/sec: 2271.50
# Latency: 45ms
# POST request with Lua script
cat > /tmp/post.lua << 'EOF'
wrk.method = "POST"
wrk.headers["Content-Type"] = "application/json"
wrk.body = '{"userId": 1, "productId": 100, "qty": 2}'
EOF
wrk -t4 -c50 -d30s -s /tmp/post.lua https://example.com/api/orders
# Show latency percentile distribution
wrk -t4 -c100 -d30s --latency https://example.com/
k6 β Scenario-Based Load Testingβ
# Install
sudo apt install k6
// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';
const errorRate = new Rate('errors');
export const options = {
stages: [
{ duration: '1m', target: 50 }, // ramp up to 50 users over 1 min
{ duration: '3m', target: 200 }, // hold 200 users for 3 min
{ duration: '1m', target: 0 }, // ramp down over 1 min
],
thresholds: {
http_req_duration: ['p(95)<500'], // 95th percentile < 500 ms
errors: ['rate<0.01'], // error rate < 1%
},
};
export default function () {
const res = http.get('https://example.com/api/products', {
headers: { 'Accept': 'application/json' },
});
check(res, {
'status 200': (r) => r.status === 200,
'response < 500ms': (r) => r.timings.duration < 500,
});
errorRate.add(res.status !== 200);
sleep(1);
}
# Run
k6 run load-test.js
# Stream results to Grafana (InfluxDB)
k6 run --out influxdb=http://localhost:8086/k6 load-test.js
h2load β HTTP/2 Performanceβ
# Install
sudo apt install nghttp2-client
# HTTP/2 load test
# -n: total requests, -c: connections, -m: concurrent streams per connection
h2load -n 10000 -c 50 -m 10 https://example.com/api/products
# finished in 5.23s, 1912.04 req/s, 2.34MB/s
# requests: 10000 total, 10000 succeeded
# status codes: 10000 2xx
# time for request: min=5ms max=234ms mean=52ms
Bottleneck Analysis Methodologyβ
Step 1: Decompose Response Timeβ
curl -o /dev/null -s -w "
DNS: %{time_namelookup}s
TCP Connect: %{time_connect}s
TLS: %{time_appconnect}s
TTFB: %{time_starttransfer}s
Total: %{time_total}s
Size: %{size_download} bytes
" https://example.com/api/products
Step 2: Separate Nginx vs Backend Timeβ
log_format perf '$remote_addr [$time_local] '
'"$request" $status '
'req=$request_time '
'upstream=$upstream_response_time '
'cache=$upstream_cache_status';
# Nginx overhead = total time - upstream time
# Large gap here β Nginx itself is the bottleneck (check cache, compression)
Step 3: Diagnose the Bottleneck Locationβ
If response is slow β checklist:
ββββββββββββββββββββββββββββββββββββββββββββββββββ
1. DNS slow? β check DNS TTL and caching
2. TCP/TLS slow? β enable Keep-Alive, TLS session cache
3. TTFB slow? β backend logic, DB queries
4. Transfer slow? β response size (compression), network bandwidth
ββββββββββββββββββββββββββββββββββββββββββββββββββ
If TTFB is slow (backend bottleneck):
- DB slow query log: EXPLAIN ANALYZE
- N+1 query problem
- Missing timeout on external API calls
- Thread pool exhaustion (Tomcat maxThreads exceeded)
If transfer is slow (network/size bottleneck):
- Compression not enabled (check gzip/brotli)
- Cache not applied (X-Cache-Status: MISS)
- Oversized response payload (check pagination)
Performance SLO Targetsβ
| Metric | Target | How to Improve |
|---|---|---|
| TTFB | < 200 ms | Caching, DB query optimization |
| p95 response time | < 500 ms | Thread pool tuning, caching |
| p99 response time | < 1000 ms | Timeouts, circuit breakers |
| Error rate | < 0.1% | Health checks, auto-retry |
| Compression ratio | > 60% (text) | Verify gzip/brotli settings |
| Cache hit rate | > 80% | Optimize proxy_cache config |
Automated Performance Gate (CI/CD)β
#!/bin/bash
# performance-gate.sh β validate performance before deployment
ENDPOINT="https://staging.example.com/api/products"
MAX_P95_MS=500
MIN_RPS=100
echo "=== Performance Gate Check ==="
result=$(wrk -t4 -c100 -d30s --latency "$ENDPOINT" 2>&1)
p95=$(echo "$result" | grep "99%" | awk '{print $2}' | sed 's/ms//')
rps=$(echo "$result" | grep "Requests/sec" | awk '{print $2}')
echo "p95: ${p95}ms (threshold: ${MAX_P95_MS}ms)"
echo "RPS: ${rps} (threshold: ${MIN_RPS})"
if (( $(echo "$p95 > $MAX_P95_MS" | bc -l) )); then
echo "β Performance gate FAILED: p95 ${p95}ms > ${MAX_P95_MS}ms"
exit 1
fi
echo "β
Performance gate PASSED"