Skip to main content

Pro Tips — Monitoring Design Strategy

Installing monitoring tools correctly and designing monitoring well are two entirely different things. It is surprisingly common to see systems where thousands of alerts fire yet critical incidents go unnoticed, or where elaborate dashboards exist that provide no useful signal during an outage. This chapter covers monitoring design strategies validated in real production environments.

Reducing Alert Fatigue

Alert fatigue is the phenomenon where engineers who are bombarded with too many notifications begin to silence or ignore even the important ones. According to a 2019 Google SRE report, excessive alerting volume causes teams to start disabling alerts rather than addressing them.

Alert Priority Matrix

Classify every alert into one of four priority levels.

PriorityDefinitionExamplesResponse
P1 - CriticalImmediate service outage impactFull server down, DB unreachableImmediate phone/SMS + page
P2 - HighSevere service quality degradationError rate > 5%, p99 > 10sSlack + respond within 15 min
P3 - MediumPotential issue, no immediate impactDisk > 80%, memory warningSlack + respond during business hours
P4 - LowInformational / trend trackingDeploy completed, cert expires in 30 daysDaily digest report

Threshold-Setting Principles

Poorly chosen thresholds are the leading cause of alert fatigue.

# Wrong — absolute threshold (fires constantly at certain hours)
- alert: HighResponseTime
expr: http_response_time_seconds > 0.5

# Correct — detect anomaly relative to a dynamic baseline
- alert: AnomalousResponseTime
expr: |
http_response_time_seconds
> on (job) group_left()
(avg_over_time(http_response_time_seconds[1h] offset 1w) * 2)
for: 5m
annotations:
summary: "Response time is 2x higher than same time last week"

# Correct — rate-of-change based
- alert: ResponseTimeSpike
expr: |
rate(http_request_duration_seconds_sum[5m])
/ rate(http_request_duration_seconds_count[5m])
> 2
for: 3m
labels:
severity: warning

Golden rules for threshold setting:

  • Always use a for clause: prevents false positives from momentary spikes. Require the condition to persist for at least 2–5 minutes.
  • Set thresholds after collecting sufficient history: analyze at least two weeks of data and base thresholds on p95/p99 values.
  • The goal is fewer alerts: if you have too many alerts, silence half of them; if still too many, silence half again.

SLO, SLI, and SLA — Defining Monitoring Objectives

Definitions

SLA (Service Level Agreement) is a contract between the service provider and the customer. Example: "Provide a refund if monthly uptime falls below 99.9%."

SLO (Service Level Objective) is the internal target the team commits to achieving. Set slightly stricter than the SLA to create a safety buffer. Example: "Maintain 99.95% monthly uptime."

SLI (Service Level Indicator) is the actual measurement used to track the SLO. Example: "Successful requests / total requests."

SLA: 99.9%    (contract with customer — breach means penalty)

SLO: 99.95% (internal target — includes safety buffer)

SLI: actual measured value (computed via Prometheus, etc.)

Computing SLIs with Prometheus

# prometheus/rules/slo.yml
groups:
- name: slo-rules
interval: 30s
rules:
# ─── SLI: Availability ───────────────────────────────────────────────
# Success rate excluding 5xx responses
- record: job:http_requests:success_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)

# ─── SLI: Latency ────────────────────────────────────────────────────
# Fraction of requests with p99 latency under 300ms
- record: job:http_requests:latency_ok_rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)

# ─── SLO breach alert ────────────────────────────────────────────────
- alert: SLOAvailabilityBreach
expr: job:http_requests:success_rate5m < 0.9995 # below 99.95%
for: 5m
labels:
severity: critical
annotations:
summary: "SLO breach: availability {{ $value | humanizePercentage }}"
description: "Service {{ $labels.job }} availability dropped below 99.95%"

Error Budget-Based Alerting Strategy

The error budget is the total amount of failure that can be tolerated while still meeting the SLO.

Monthly SLO: 99.9%
Total minutes in a month: 43,200
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes

When the error budget is exhausted:
- Halt new feature deployments
- Focus entirely on stability
#!/usr/bin/env python3
# error_budget_calculator.py
# Error budget burn rate calculator

def calculate_error_budget(slo_target: float, total_requests: int, failed_requests: int) -> dict:
"""
slo_target: SLO target as a decimal (e.g., 0.999 = 99.9%)
total_requests: total number of requests in the period
failed_requests: number of failed requests in the period
"""
allowed_failures = total_requests * (1 - slo_target)
actual_success_rate = (total_requests - failed_requests) / total_requests
error_budget_remaining = (allowed_failures - failed_requests) / allowed_failures * 100

return {
"slo_target": f"{slo_target * 100:.2f}%",
"actual_success_rate": f"{actual_success_rate * 100:.4f}%",
"allowed_failures": int(allowed_failures),
"actual_failures": failed_requests,
"error_budget_remaining": f"{error_budget_remaining:.1f}%",
"status": "OK" if error_budget_remaining > 0 else "EXHAUSTED"
}

if __name__ == "__main__":
# Example: 800 failures out of 1 million requests this month
result = calculate_error_budget(
slo_target=0.999,
total_requests=1_000_000,
failed_requests=800
)
for key, value in result.items():
print(f"{key:30s}: {value}")

# Output:
# slo_target : 99.90%
# actual_success_rate : 99.9200%
# allowed_failures : 1000
# actual_failures : 800
# error_budget_remaining : 20.0%
# status : OK

Alert strategy based on error budget burn rate:

# prometheus/rules/error-budget.yml
groups:
- name: error-budget
rules:
# 50% of budget consumed → warning
- alert: ErrorBudgetBurnRateHigh
expr: |
(1 - job:http_requests:success_rate5m)
/ (1 - 0.999) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "Error budget burning 5x faster than expected"

# 90% of budget consumed → critical
- alert: ErrorBudgetAlmostExhausted
expr: |
(1 - job:http_requests:success_rate5m)
/ (1 - 0.999) > 14.4 # budget will be gone within 1 hour
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget will be exhausted within 1 hour at current rate"

Golden Signals

The four core metrics defined by Google's SRE team for monitoring any service.

1. Latency

Time taken to service a request. Always separate successful requests from failed requests. Errors often return very quickly, so including them can make latency appear lower than it actually is.

- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])
) > 1.0
for: 5m
annotations:
summary: "p99 latency for successful requests exceeds 1s"

2. Traffic

The amount of demand placed on the system. Measured as requests per second (RPS), bytes per second, etc.

- alert: UnexpectedTrafficDrop
expr: |
rate(http_requests_total[5m])
< (avg_over_time(rate(http_requests_total[5m])[1h:5m]) * 0.5)
for: 5m
annotations:
summary: "Traffic dropped below 50% of 1-hour average — possible upstream issue"

3. Errors

The rate of requests that fail. Track both explicit errors (5xx) and implicit errors (incorrect content returned with 200 OK).

- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
> 0.01 # error rate exceeds 1%
for: 3m

4. Saturation

How "full" the service is. Performance degrades as resources approach their limits.

# CPU saturation
- alert: HighCPUSaturation
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m

# JVM heap saturation
- alert: JVMHeapPressure
expr: |
jvm_memory_used_bytes{area="heap"}
/ jvm_memory_max_bytes{area="heap"}
> 0.9
for: 5m
annotations:
summary: "JVM heap usage above 90% — GC pressure likely"

Nginx/Tomcat Core Metrics Checklist

Essential Nginx Metrics

# nginx.conf — enable stub_status module
server {
listen 8080;
location /nginx_status {
stub_status on;
allow 127.0.0.1;
allow 172.16.0.0/12; # Docker network
deny all;
}
}
MetricMeaningAlert Threshold
nginx_connections_activeCurrently active connections80% of worker_connections
nginx_connections_waitingKeep-alive idle connectionsHigh value → review timeout settings
nginx_http_requests_totalTotal requests (compute RPS)Abnormal spikes or drops
5xx error rateServer error ratioAlert if > 1%
nginx_ingress_upstream_latencyUpstream response timep99 > 500ms

Essential Tomcat Metrics

<!-- server.xml — configure connection limits -->
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
maxThreads="200"
minSpareThreads="10"
maxConnections="10000"
acceptCount="100" />
MetricMeaningAlert Threshold
tomcat_threads_busyCurrently processing threads80% of maxThreads
tomcat_threads_config_maxMax thread countReference only
tomcat_connections_currentCurrent active connections90% of maxConnections
tomcat_global_request_countRequests processedMonitor RPS trend
tomcat_global_error_countError countAlert on increasing trend
JVM GC time ratioGC overheadAlert if > 5% of total time

Dashboard Design Principles: Layered Views

A well-designed dashboard answers "what is wrong?" quickly. Structure dashboards across three layers.

Layer 1: Business View
- Order conversion rate, revenue, active users
- Audience: business executives
- Refresh: every 5 minutes

Layer 2: Application View
- Per-endpoint performance (Golden Signals)
- Error rate, latency, RPS
- Audience: dev team / SRE
- Refresh: every 1 minute

Layer 3: Infrastructure View
- CPU, memory, disk, network
- Per-server detail metrics
- Audience: infrastructure team
- Refresh: every 30 seconds

Using Grafana dashboard variables:

{
"templating": {
"list": [
{
"name": "environment",
"type": "custom",
"options": [
{"text": "Production", "value": "prod"},
{"text": "Staging", "value": "staging"}
]
},
{
"name": "instance",
"type": "query",
"query": "label_values(up{env=\"$environment\"}, instance)"
}
]
}
}

Dashboard anti-patterns to avoid:

  • Too many panels (limit to 20 per dashboard)
  • Overuse of colors (use a traffic-light system: green = OK, yellow = warning, red = critical)
  • Displaying numbers without context (never show a current value alone — always pair it with a trend line)

On-call System Design Tips

The most common mistake when first setting up on-call is configuring "everyone receives every alert."

Escalation policy example (PagerDuty/OpsGenie):

Primary On-call:
- Receives P1/P2 alerts immediately
- No acknowledgement within 5 min → escalate to Secondary

Secondary On-call:
- Receives alert if Primary does not respond
- No acknowledgement within 10 min → escalate to manager

Manager:
- Alerted after 15+ minutes of no response
- Assesses business impact and summons additional engineers

Rotation cadence: 1 week (2+ weeks risks burnout)
Compensation: On-call is unsustainable without explicit recognition and compensation

Minimizing on-call fatigue:

# Post-alert retrospective checklist
# 1. Did this alert actually require action? (Yes/No)
# 2. Was the alert threshold appropriate?
# 3. Was there a runbook? Was it helpful?
# 4. Could this have been resolved automatically?
# 5. What must be done to prevent recurrence?

Writing Runbooks

A runbook is a documented response procedure for an incident. It must be simple enough for an engineer to follow at 3 a.m. after being woken from sleep.

Required runbook sections:

# Runbook: High Error Rate on /api/orders

## Alert Condition
- Alert name: OrderServiceHighErrorRate
- Trigger: error_rate > 5% for 3 minutes

## Immediate Diagnostic Commands
```bash
# 1. Check current error rate
curl -s http://prometheus:9090/api/v1/query \
--data-urlencode 'query=rate(http_requests_total{status=~"5..",job="order-service"}[5m])' \
| jq '.data.result[].value[1]'

# 2. Check recent error logs (Kibana query)
# index: tomcat-error-*, filter: log_level:ERROR AND logger:*OrderController*

# 3. Check DB connection status
docker exec -it order-service-db mysql -u root -p \
-e "SHOW STATUS LIKE 'Threads_connected'; SHOW PROCESSLIST;"

# 4. Check recent deployment history
kubectl rollout history deployment/order-service

Possible Causes and Remediation

CauseSymptomsRemediation
DB connection pool exhaustedDB-related errors, response time spikeIncrease pool size, check slow queries
External API failureOnly specific endpoints failCheck circuit breaker, enable fallback
Out of memoryOOM errors, GC overloadRestart instance, add memory
Bad code deploymentError spike immediately after deployRoll back to previous version

Escalation

  • Not resolved within 15 min → page team lead
  • DB failure → contact DBA team
  • External service failure → check vendor emergency contact

## Log Retention Policy and Cost Optimization

Log storage costs grow faster than most teams expect. A tiered storage strategy is essential.

Hot Tier (last 7 days): SSD, fast search → Elasticsearch Hot nodes Warm Tier (7–30 days): HDD, slower search acceptable → Elasticsearch Warm nodes Cold Tier (30–90 days): Object storage → S3/GCS (Elasticsearch Frozen Index) Archive (90+ days): Glacier/Coldline — long-term retention, rarely queried


**Recommended retention periods by log type:**

| Log Type | Retention | Reason |
|---|---|---|
| Access logs | 90 days | Security audit, traffic analysis |
| Error/exception logs | 1 year | Incident reproduction, compliance |
| Payment/transaction logs | 5 years+ | Accounting audit, legal requirements |
| Debug logs | 7 days | Development use only, no long-term value |
| Security events | 2 years+ | Security audit, breach investigation |

**ILM-based cost optimization:**

```bash
# Check disk usage per index
curl -u elastic:changeme123! \
"http://localhost:9200/_cat/indices?v&s=store.size:desc&h=index,store.size,pri.store.size,docs.count"

# Force merge to reduce storage on warm-tier indices
curl -X POST "http://localhost:9200/nginx-access-2026.01.*/_forcemerge?max_num_segments=1" \
-u elastic:changeme123!

# Manually delete old indices (when ILM policy is not configured)
curl -X DELETE "http://localhost:9200/nginx-access-2025.12.*" \
-u elastic:changeme123!

Cost reduction tips:

  • Sampling: For high-traffic services, collect only 10–20% of access logs (always collect 100% of error logs).
  • Field pruning: Remove unnecessary fields in Logstash before indexing.
  • Index compression: Run force merge to a single segment when moving indices to the Warm or Cold tier.
  • SLO over alerts: Rather than alerting on everything, focus monitoring on whether SLO targets are being met.