Pro Tips — Monitoring Design Strategy
Installing monitoring tools correctly and designing monitoring well are two entirely different things. It is surprisingly common to see systems where thousands of alerts fire yet critical incidents go unnoticed, or where elaborate dashboards exist that provide no useful signal during an outage. This chapter covers monitoring design strategies validated in real production environments.
Reducing Alert Fatigue
Alert fatigue is the phenomenon where engineers who are bombarded with too many notifications begin to silence or ignore even the important ones. According to a 2019 Google SRE report, excessive alerting volume causes teams to start disabling alerts rather than addressing them.
Alert Priority Matrix
Classify every alert into one of four priority levels.
| Priority | Definition | Examples | Response |
|---|---|---|---|
| P1 - Critical | Immediate service outage impact | Full server down, DB unreachable | Immediate phone/SMS + page |
| P2 - High | Severe service quality degradation | Error rate > 5%, p99 > 10s | Slack + respond within 15 min |
| P3 - Medium | Potential issue, no immediate impact | Disk > 80%, memory warning | Slack + respond during business hours |
| P4 - Low | Informational / trend tracking | Deploy completed, cert expires in 30 days | Daily digest report |
Threshold-Setting Principles
Poorly chosen thresholds are the leading cause of alert fatigue.
# Wrong — absolute threshold (fires constantly at certain hours)
- alert: HighResponseTime
expr: http_response_time_seconds > 0.5
# Correct — detect anomaly relative to a dynamic baseline
- alert: AnomalousResponseTime
expr: |
http_response_time_seconds
> on (job) group_left()
(avg_over_time(http_response_time_seconds[1h] offset 1w) * 2)
for: 5m
annotations:
summary: "Response time is 2x higher than same time last week"
# Correct — rate-of-change based
- alert: ResponseTimeSpike
expr: |
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
> 2
for: 3m
labels:
severity: warning
Golden rules for threshold setting:
- Always use a
forclause: prevents false positives from momentary spikes. Require the condition to persist for at least 2–5 minutes. - Set thresholds after collecting sufficient history: analyze at least two weeks of data and base thresholds on p95/p99 values.
- The goal is fewer alerts: if you have too many alerts, silence half of them; if still too many, silence half again.
SLO, SLI, and SLA — Defining Monitoring Objectives
Definitions
SLA (Service Level Agreement) is a contract between the service provider and the customer. Example: "Provide a refund if monthly uptime falls below 99.9%."
SLO (Service Level Objective) is the internal target the team commits to achieving. Set slightly stricter than the SLA to create a safety buffer. Example: "Maintain 99.95% monthly uptime."
SLI (Service Level Indicator) is the actual measurement used to track the SLO. Example: "Successful requests / total requests."
SLA: 99.9% (contract with customer — breach means penalty)
↑
SLO: 99.95% (internal target — includes safety buffer)
↑
SLI: actual measured value (computed via Prometheus, etc.)
Computing SLIs with Prometheus
# prometheus/rules/slo.yml
groups:
- name: slo-rules
interval: 30s
rules:
# ─── SLI: Availability ───────────────────────────────────────────────
# Success rate excluding 5xx responses
- record: job:http_requests:success_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# ─── SLI: Latency ────────────────────────────────────────────────────
# Fraction of requests with p99 latency under 300ms
- record: job:http_requests:latency_ok_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)
# ─── SLO breach alert ────────────────────────────────────────────────
- alert: SLOAvailabilityBreach
expr: job:http_requests:success_rate5m < 0.9995 # below 99.95%
for: 5m
labels:
severity: critical
annotations:
summary: "SLO breach: availability {{ $value | humanizePercentage }}"
description: "Service {{ $labels.job }} availability dropped below 99.95%"
Error Budget-Based Alerting Strategy
The error budget is the total amount of failure that can be tolerated while still meeting the SLO.
Monthly SLO: 99.9%
Total minutes in a month: 43,200
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes
When the error budget is exhausted:
- Halt new feature deployments
- Focus entirely on stability
#!/usr/bin/env python3
# error_budget_calculator.py
# Error budget burn rate calculator
def calculate_error_budget(slo_target: float, total_requests: int, failed_requests: int) -> dict:
"""
slo_target: SLO target as a decimal (e.g., 0.999 = 99.9%)
total_requests: total number of requests in the period
failed_requests: number of failed requests in the period
"""
allowed_failures = total_requests * (1 - slo_target)
actual_success_rate = (total_requests - failed_requests) / total_requests
error_budget_remaining = (allowed_failures - failed_requests) / allowed_failures * 100
return {
"slo_target": f"{slo_target * 100:.2f}%",
"actual_success_rate": f"{actual_success_rate * 100:.4f}%",
"allowed_failures": int(allowed_failures),
"actual_failures": failed_requests,
"error_budget_remaining": f"{error_budget_remaining:.1f}%",
"status": "OK" if error_budget_remaining > 0 else "EXHAUSTED"
}
if __name__ == "__main__":
# Example: 800 failures out of 1 million requests this month
result = calculate_error_budget(
slo_target=0.999,
total_requests=1_000_000,
failed_requests=800
)
for key, value in result.items():
print(f"{key:30s}: {value}")
# Output:
# slo_target : 99.90%
# actual_success_rate : 99.9200%
# allowed_failures : 1000
# actual_failures : 800
# error_budget_remaining : 20.0%
# status : OK
Alert strategy based on error budget burn rate:
# prometheus/rules/error-budget.yml
groups:
- name: error-budget
rules:
# 50% of budget consumed → warning
- alert: ErrorBudgetBurnRateHigh
expr: |
(1 - job:http_requests:success_rate5m)
/ (1 - 0.999) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Error budget burning 5x faster than expected"
# 90% of budget consumed → critical
- alert: ErrorBudgetAlmostExhausted
expr: |
(1 - job:http_requests:success_rate5m)
/ (1 - 0.999) > 14.4 # budget will be gone within 1 hour
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget will be exhausted within 1 hour at current rate"
Golden Signals
The four core metrics defined by Google's SRE team for monitoring any service.
1. Latency
Time taken to service a request. Always separate successful requests from failed requests. Errors often return very quickly, so including them can make latency appear lower than it actually is.
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])
) > 1.0
for: 5m
annotations:
summary: "p99 latency for successful requests exceeds 1s"
2. Traffic
The amount of demand placed on the system. Measured as requests per second (RPS), bytes per second, etc.
- alert: UnexpectedTrafficDrop
expr: |
rate(http_requests_total[5m])
< (avg_over_time(rate(http_requests_total[5m])[1h:5m]) * 0.5)
for: 5m
annotations:
summary: "Traffic dropped below 50% of 1-hour average — possible upstream issue"
3. Errors
The rate of requests that fail. Track both explicit errors (5xx) and implicit errors (incorrect content returned with 200 OK).
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
> 0.01 # error rate exceeds 1%
for: 3m
4. Saturation
How "full" the service is. Performance degrades as resources approach their limits.
# CPU saturation
- alert: HighCPUSaturation
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
# JVM heap saturation
- alert: JVMHeapPressure
expr: |
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"}
> 0.9
for: 5m
annotations:
summary: "JVM heap usage above 90% — GC pressure likely"
Nginx/Tomcat Core Metrics Checklist
Essential Nginx Metrics
# nginx.conf — enable stub_status module
server {
listen 8080;
location /nginx_status {
stub_status on;
allow 127.0.0.1;
allow 172.16.0.0/12; # Docker network
deny all;
}
}
| Metric | Meaning | Alert Threshold |
|---|---|---|
nginx_connections_active | Currently active connections | 80% of worker_connections |
nginx_connections_waiting | Keep-alive idle connections | High value → review timeout settings |
nginx_http_requests_total | Total requests (compute RPS) | Abnormal spikes or drops |
| 5xx error rate | Server error ratio | Alert if > 1% |
nginx_ingress_upstream_latency | Upstream response time | p99 > 500ms |
Essential Tomcat Metrics
<!-- server.xml — configure connection limits -->
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
maxThreads="200"
minSpareThreads="10"
maxConnections="10000"
acceptCount="100" />
| Metric | Meaning | Alert Threshold |
|---|---|---|
tomcat_threads_busy | Currently processing threads | 80% of maxThreads |
tomcat_threads_config_max | Max thread count | Reference only |
tomcat_connections_current | Current active connections | 90% of maxConnections |
tomcat_global_request_count | Requests processed | Monitor RPS trend |
tomcat_global_error_count | Error count | Alert on increasing trend |
| JVM GC time ratio | GC overhead | Alert if > 5% of total time |
Dashboard Design Principles: Layered Views
A well-designed dashboard answers "what is wrong?" quickly. Structure dashboards across three layers.
Layer 1: Business View
- Order conversion rate, revenue, active users
- Audience: business executives
- Refresh: every 5 minutes
Layer 2: Application View
- Per-endpoint performance (Golden Signals)
- Error rate, latency, RPS
- Audience: dev team / SRE
- Refresh: every 1 minute
Layer 3: Infrastructure View
- CPU, memory, disk, network
- Per-server detail metrics
- Audience: infrastructure team
- Refresh: every 30 seconds
Using Grafana dashboard variables:
{
"templating": {
"list": [
{
"name": "environment",
"type": "custom",
"options": [
{"text": "Production", "value": "prod"},
{"text": "Staging", "value": "staging"}
]
},
{
"name": "instance",
"type": "query",
"query": "label_values(up{env=\"$environment\"}, instance)"
}
]
}
}
Dashboard anti-patterns to avoid:
- Too many panels (limit to 20 per dashboard)
- Overuse of colors (use a traffic-light system: green = OK, yellow = warning, red = critical)
- Displaying numbers without context (never show a current value alone — always pair it with a trend line)
On-call System Design Tips
The most common mistake when first setting up on-call is configuring "everyone receives every alert."
Escalation policy example (PagerDuty/OpsGenie):
Primary On-call:
- Receives P1/P2 alerts immediately
- No acknowledgement within 5 min → escalate to Secondary
Secondary On-call:
- Receives alert if Primary does not respond
- No acknowledgement within 10 min → escalate to manager
Manager:
- Alerted after 15+ minutes of no response
- Assesses business impact and summons additional engineers
Rotation cadence: 1 week (2+ weeks risks burnout)
Compensation: On-call is unsustainable without explicit recognition and compensation
Minimizing on-call fatigue:
# Post-alert retrospective checklist
# 1. Did this alert actually require action? (Yes/No)
# 2. Was the alert threshold appropriate?
# 3. Was there a runbook? Was it helpful?
# 4. Could this have been resolved automatically?
# 5. What must be done to prevent recurrence?
Writing Runbooks
A runbook is a documented response procedure for an incident. It must be simple enough for an engineer to follow at 3 a.m. after being woken from sleep.
Required runbook sections:
# Runbook: High Error Rate on /api/orders
## Alert Condition
- Alert name: OrderServiceHighErrorRate
- Trigger: error_rate > 5% for 3 minutes
## Immediate Diagnostic Commands
```bash
# 1. Check current error rate
curl -s http://prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_requests_total{status=~"5..",job="order-service"}[5m])' \
| jq '.data.result[].value[1]'
# 2. Check recent error logs (Kibana query)
# index: tomcat-error-*, filter: log_level:ERROR AND logger:*OrderController*
# 3. Check DB connection status
docker exec -it order-service-db mysql -u root -p \
-e "SHOW STATUS LIKE 'Threads_connected'; SHOW PROCESSLIST;"
# 4. Check recent deployment history
kubectl rollout history deployment/order-service
Possible Causes and Remediation
| Cause | Symptoms | Remediation |
|---|---|---|
| DB connection pool exhausted | DB-related errors, response time spike | Increase pool size, check slow queries |
| External API failure | Only specific endpoints fail | Check circuit breaker, enable fallback |
| Out of memory | OOM errors, GC overload | Restart instance, add memory |
| Bad code deployment | Error spike immediately after deploy | Roll back to previous version |
Escalation
- Not resolved within 15 min → page team lead
- DB failure → contact DBA team
- External service failure → check vendor emergency contact
## Log Retention Policy and Cost Optimization
Log storage costs grow faster than most teams expect. A tiered storage strategy is essential.
Hot Tier (last 7 days): SSD, fast search → Elasticsearch Hot nodes Warm Tier (7–30 days): HDD, slower search acceptable → Elasticsearch Warm nodes Cold Tier (30–90 days): Object storage → S3/GCS (Elasticsearch Frozen Index) Archive (90+ days): Glacier/Coldline — long-term retention, rarely queried
**Recommended retention periods by log type:**
| Log Type | Retention | Reason |
|---|---|---|
| Access logs | 90 days | Security audit, traffic analysis |
| Error/exception logs | 1 year | Incident reproduction, compliance |
| Payment/transaction logs | 5 years+ | Accounting audit, legal requirements |
| Debug logs | 7 days | Development use only, no long-term value |
| Security events | 2 years+ | Security audit, breach investigation |
**ILM-based cost optimization:**
```bash
# Check disk usage per index
curl -u elastic:changeme123! \
"http://localhost:9200/_cat/indices?v&s=store.size:desc&h=index,store.size,pri.store.size,docs.count"
# Force merge to reduce storage on warm-tier indices
curl -X POST "http://localhost:9200/nginx-access-2026.01.*/_forcemerge?max_num_segments=1" \
-u elastic:changeme123!
# Manually delete old indices (when ILM policy is not configured)
curl -X DELETE "http://localhost:9200/nginx-access-2025.12.*" \
-u elastic:changeme123!
Cost reduction tips:
- Sampling: For high-traffic services, collect only 10–20% of access logs (always collect 100% of error logs).
- Field pruning: Remove unnecessary fields in Logstash before indexing.
- Index compression: Run force merge to a single segment when moving indices to the Warm or Cold tier.
- SLO over alerts: Rather than alerting on everything, focus monitoring on whether SLO targets are being met.