Pro Tips — Monitoring Design Strategy

Installing monitoring tools correctly and designing monitoring well are two entirely different things. It is surprisingly common to see systems where thousands of alerts fire yet critical incidents go unnoticed, or where elaborate dashboards exist that provide no useful signal during an outage. This chapter covers monitoring design strategies validated in real production environments.

Reducing Alert Fatigue

Alert fatigue is the phenomenon where engineers who are bombarded with too many notifications begin to silence or ignore even the important ones. According to a 2019 Google SRE report, excessive alerting volume causes teams to start disabling alerts rather than addressing them.

Alert Priority Matrix

Classify every alert into one of four priority levels.

Priority	Definition	Examples	Response
P1 - Critical	Immediate service outage impact	Full server down, DB unreachable	Immediate phone/SMS + page
P2 - High	Severe service quality degradation	Error rate > 5%, p99 > 10s	Slack + respond within 15 min
P3 - Medium	Potential issue, no immediate impact	Disk > 80%, memory warning	Slack + respond during business hours
P4 - Low	Informational / trend tracking	Deploy completed, cert expires in 30 days	Daily digest report

Threshold-Setting Principles

Poorly chosen thresholds are the leading cause of alert fatigue.

# Wrong — absolute threshold (fires constantly at certain hours)
- alert: HighResponseTime
  expr: http_response_time_seconds > 0.5

# Correct — detect anomaly relative to a dynamic baseline
- alert: AnomalousResponseTime
  expr: |
    http_response_time_seconds
    > on (job) group_left()
    (avg_over_time(http_response_time_seconds[1h] offset 1w) * 2)
  for: 5m
  annotations:
    summary: "Response time is 2x higher than same time last week"

# Correct — rate-of-change based
- alert: ResponseTimeSpike
  expr: |
    rate(http_request_duration_seconds_sum[5m])
    / rate(http_request_duration_seconds_count[5m])
    > 2
  for: 3m
  labels:
    severity: warning

Golden rules for threshold setting:

Always use a for clause: prevents false positives from momentary spikes. Require the condition to persist for at least 2–5 minutes.
Set thresholds after collecting sufficient history: analyze at least two weeks of data and base thresholds on p95/p99 values.
The goal is fewer alerts: if you have too many alerts, silence half of them; if still too many, silence half again.

SLO, SLI, and SLA — Defining Monitoring Objectives

Definitions

SLA (Service Level Agreement) is a contract between the service provider and the customer. Example: "Provide a refund if monthly uptime falls below 99.9%."

SLO (Service Level Objective) is the internal target the team commits to achieving. Set slightly stricter than the SLA to create a safety buffer. Example: "Maintain 99.95% monthly uptime."

SLI (Service Level Indicator) is the actual measurement used to track the SLO. Example: "Successful requests / total requests."

SLA: 99.9%    (contract with customer — breach means penalty)
 ↑
SLO: 99.95%   (internal target — includes safety buffer)
 ↑
SLI: actual measured value (computed via Prometheus, etc.)

Computing SLIs with Prometheus

# prometheus/rules/slo.yml
groups:
  - name: slo-rules
    interval: 30s
    rules:
      # ─── SLI: Availability ───────────────────────────────────────────────
      # Success rate excluding 5xx responses
      - record: job:http_requests:success_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)

      # ─── SLI: Latency ────────────────────────────────────────────────────
      # Fraction of requests with p99 latency under 300ms
      - record: job:http_requests:latency_ok_rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (job)

      # ─── SLO breach alert ────────────────────────────────────────────────
      - alert: SLOAvailabilityBreach
        expr: job:http_requests:success_rate5m < 0.9995  # below 99.95%
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "SLO breach: availability {{ $value | humanizePercentage }}"
          description: "Service {{ $labels.job }} availability dropped below 99.95%"

Error Budget-Based Alerting Strategy

The error budget is the total amount of failure that can be tolerated while still meeting the SLO.

Monthly SLO: 99.9%
Total minutes in a month: 43,200
Allowed downtime = 43,200 × (1 - 0.999) = 43.2 minutes

When the error budget is exhausted:
- Halt new feature deployments
- Focus entirely on stability

#!/usr/bin/env python3
# error_budget_calculator.py
# Error budget burn rate calculator

def calculate_error_budget(slo_target: float, total_requests: int, failed_requests: int) -> dict:
    """
    slo_target: SLO target as a decimal (e.g., 0.999 = 99.9%)
    total_requests: total number of requests in the period
    failed_requests: number of failed requests in the period
    """
    allowed_failures = total_requests * (1 - slo_target)
    actual_success_rate = (total_requests - failed_requests) / total_requests
    error_budget_remaining = (allowed_failures - failed_requests) / allowed_failures * 100

    return {
        "slo_target":              f"{slo_target * 100:.2f}%",
        "actual_success_rate":     f"{actual_success_rate * 100:.4f}%",
        "allowed_failures":        int(allowed_failures),
        "actual_failures":         failed_requests,
        "error_budget_remaining":  f"{error_budget_remaining:.1f}%",
        "status": "OK" if error_budget_remaining > 0 else "EXHAUSTED"
    }

if __name__ == "__main__":
    # Example: 800 failures out of 1 million requests this month
    result = calculate_error_budget(
        slo_target=0.999,
        total_requests=1_000_000,
        failed_requests=800
    )
    for key, value in result.items():
        print(f"{key:30s}: {value}")

# Output:
# slo_target                    : 99.90%
# actual_success_rate           : 99.9200%
# allowed_failures              : 1000
# actual_failures               : 800
# error_budget_remaining        : 20.0%
# status                        : OK

Alert strategy based on error budget burn rate:

# prometheus/rules/error-budget.yml
groups:
  - name: error-budget
    rules:
      # 50% of budget consumed → warning
      - alert: ErrorBudgetBurnRateHigh
        expr: |
          (1 - job:http_requests:success_rate5m)
          / (1 - 0.999) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Error budget burning 5x faster than expected"

      # 90% of budget consumed → critical
      - alert: ErrorBudgetAlmostExhausted
        expr: |
          (1 - job:http_requests:success_rate5m)
          / (1 - 0.999) > 14.4  # budget will be gone within 1 hour
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget will be exhausted within 1 hour at current rate"

Golden Signals

The four core metrics defined by Google's SRE team for monitoring any service.

1. Latency

Time taken to service a request. Always separate successful requests from failed requests. Errors often return very quickly, so including them can make latency appear lower than it actually is.

- alert: HighLatency
  expr: |
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])
    ) > 1.0
  for: 5m
  annotations:
    summary: "p99 latency for successful requests exceeds 1s"

2. Traffic

The amount of demand placed on the system. Measured as requests per second (RPS), bytes per second, etc.

- alert: UnexpectedTrafficDrop
  expr: |
    rate(http_requests_total[5m])
    < (avg_over_time(rate(http_requests_total[5m])[1h:5m]) * 0.5)
  for: 5m
  annotations:
    summary: "Traffic dropped below 50% of 1-hour average — possible upstream issue"

3. Errors

The rate of requests that fail. Track both explicit errors (5xx) and implicit errors (incorrect content returned with 200 OK).

- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
    /
    sum(rate(http_requests_total[5m])) by (job)
    > 0.01  # error rate exceeds 1%
  for: 3m

4. Saturation

How "full" the service is. Performance degrades as resources approach their limits.

# CPU saturation
- alert: HighCPUSaturation
  expr: |
    100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
  for: 10m

# JVM heap saturation
- alert: JVMHeapPressure
  expr: |
    jvm_memory_used_bytes{area="heap"}
    / jvm_memory_max_bytes{area="heap"}
    > 0.9
  for: 5m
  annotations:
    summary: "JVM heap usage above 90% — GC pressure likely"

Nginx/Tomcat Core Metrics Checklist

Essential Nginx Metrics

# nginx.conf — enable stub_status module
server {
    listen 8080;
    location /nginx_status {
        stub_status on;
        allow 127.0.0.1;
        allow 172.16.0.0/12;  # Docker network
        deny all;
    }
}

Metric	Meaning	Alert Threshold
`nginx_connections_active`	Currently active connections	80% of `worker_connections`
`nginx_connections_waiting`	Keep-alive idle connections	High value → review timeout settings
`nginx_http_requests_total`	Total requests (compute RPS)	Abnormal spikes or drops
5xx error rate	Server error ratio	Alert if > 1%
`nginx_ingress_upstream_latency`	Upstream response time	p99 > 500ms

Essential Tomcat Metrics

<!-- server.xml — configure connection limits -->
<Connector port="8080" protocol="HTTP/1.1"
           connectionTimeout="20000"
           maxThreads="200"
           minSpareThreads="10"
           maxConnections="10000"
           acceptCount="100" />

Metric	Meaning	Alert Threshold
`tomcat_threads_busy`	Currently processing threads	80% of `maxThreads`
`tomcat_threads_config_max`	Max thread count	Reference only
`tomcat_connections_current`	Current active connections	90% of `maxConnections`
`tomcat_global_request_count`	Requests processed	Monitor RPS trend
`tomcat_global_error_count`	Error count	Alert on increasing trend
JVM GC time ratio	GC overhead	Alert if > 5% of total time

Dashboard Design Principles: Layered Views

A well-designed dashboard answers "what is wrong?" quickly. Structure dashboards across three layers.

Layer 1: Business View
  - Order conversion rate, revenue, active users
  - Audience: business executives
  - Refresh: every 5 minutes

Layer 2: Application View
  - Per-endpoint performance (Golden Signals)
  - Error rate, latency, RPS
  - Audience: dev team / SRE
  - Refresh: every 1 minute

Layer 3: Infrastructure View
  - CPU, memory, disk, network
  - Per-server detail metrics
  - Audience: infrastructure team
  - Refresh: every 30 seconds

Using Grafana dashboard variables:

{
  "templating": {
    "list": [
      {
        "name": "environment",
        "type": "custom",
        "options": [
          {"text": "Production", "value": "prod"},
          {"text": "Staging",    "value": "staging"}
        ]
      },
      {
        "name": "instance",
        "type": "query",
        "query": "label_values(up{env=\"$environment\"}, instance)"
      }
    ]
  }
}

Dashboard anti-patterns to avoid:

Too many panels (limit to 20 per dashboard)
Overuse of colors (use a traffic-light system: green = OK, yellow = warning, red = critical)
Displaying numbers without context (never show a current value alone — always pair it with a trend line)

On-call System Design Tips

The most common mistake when first setting up on-call is configuring "everyone receives every alert."

Escalation policy example (PagerDuty/OpsGenie):

Primary On-call:
  - Receives P1/P2 alerts immediately
  - No acknowledgement within 5 min → escalate to Secondary

Secondary On-call:
  - Receives alert if Primary does not respond
  - No acknowledgement within 10 min → escalate to manager

Manager:
  - Alerted after 15+ minutes of no response
  - Assesses business impact and summons additional engineers

Rotation cadence: 1 week (2+ weeks risks burnout)
Compensation: On-call is unsustainable without explicit recognition and compensation

Minimizing on-call fatigue:

# Post-alert retrospective checklist
# 1. Did this alert actually require action? (Yes/No)
# 2. Was the alert threshold appropriate?
# 3. Was there a runbook? Was it helpful?
# 4. Could this have been resolved automatically?
# 5. What must be done to prevent recurrence?

Writing Runbooks

A runbook is a documented response procedure for an incident. It must be simple enough for an engineer to follow at 3 a.m. after being woken from sleep.

Required runbook sections:

# Runbook: High Error Rate on /api/orders

## Alert Condition
- Alert name: OrderServiceHighErrorRate
- Trigger: error_rate > 5% for 3 minutes

## Immediate Diagnostic Commands
```bash
# 1. Check current error rate
curl -s http://prometheus:9090/api/v1/query \
  --data-urlencode 'query=rate(http_requests_total{status=~"5..",job="order-service"}[5m])' \
  | jq '.data.result[].value[1]'

# 2. Check recent error logs (Kibana query)
# index: tomcat-error-*, filter: log_level:ERROR AND logger:*OrderController*

# 3. Check DB connection status
docker exec -it order-service-db mysql -u root -p \
  -e "SHOW STATUS LIKE 'Threads_connected'; SHOW PROCESSLIST;"

# 4. Check recent deployment history
kubectl rollout history deployment/order-service

Possible Causes and Remediation

Cause	Symptoms	Remediation
DB connection pool exhausted	DB-related errors, response time spike	Increase pool size, check slow queries
External API failure	Only specific endpoints fail	Check circuit breaker, enable fallback
Out of memory	OOM errors, GC overload	Restart instance, add memory
Bad code deployment	Error spike immediately after deploy	Roll back to previous version

Escalation

Not resolved within 15 min → page team lead
DB failure → contact DBA team
External service failure → check vendor emergency contact

## Log Retention Policy and Cost Optimization

Log storage costs grow faster than most teams expect. A tiered storage strategy is essential.

Hot Tier (last 7 days): SSD, fast search → Elasticsearch Hot nodes Warm Tier (7–30 days): HDD, slower search acceptable → Elasticsearch Warm nodes Cold Tier (30–90 days): Object storage → S3/GCS (Elasticsearch Frozen Index) Archive (90+ days): Glacier/Coldline — long-term retention, rarely queried

**Recommended retention periods by log type:**

| Log Type | Retention | Reason |
|---|---|---|
| Access logs | 90 days | Security audit, traffic analysis |
| Error/exception logs | 1 year | Incident reproduction, compliance |
| Payment/transaction logs | 5 years+ | Accounting audit, legal requirements |
| Debug logs | 7 days | Development use only, no long-term value |
| Security events | 2 years+ | Security audit, breach investigation |

**ILM-based cost optimization:**

```bash
# Check disk usage per index
curl -u elastic:changeme123! \
  "http://localhost:9200/_cat/indices?v&s=store.size:desc&h=index,store.size,pri.store.size,docs.count"

# Force merge to reduce storage on warm-tier indices
curl -X POST "http://localhost:9200/nginx-access-2026.01.*/_forcemerge?max_num_segments=1" \
  -u elastic:changeme123!

# Manually delete old indices (when ILM policy is not configured)
curl -X DELETE "http://localhost:9200/nginx-access-2025.12.*" \
  -u elastic:changeme123!

Cost reduction tips:

Sampling: For high-traffic services, collect only 10–20% of access logs (always collect 100% of error logs).
Field pruning: Remove unnecessary fields in Logstash before indexing.
Index compression: Run force merge to a single segment when moving indices to the Warm or Cold tier.
SLO over alerts: Rather than alerting on everything, focus monitoring on whether SLO targets are being met.

Reducing Alert Fatigue​

Alert Priority Matrix​

Threshold-Setting Principles​

SLO, SLI, and SLA — Defining Monitoring Objectives​

Definitions​

Computing SLIs with Prometheus​

Error Budget-Based Alerting Strategy​

Golden Signals​

1. Latency​

2. Traffic​

3. Errors​

4. Saturation​

Nginx/Tomcat Core Metrics Checklist​

Essential Nginx Metrics​

Essential Tomcat Metrics​

Dashboard Design Principles: Layered Views​

On-call System Design Tips​

Writing Runbooks​

Possible Causes and Remediation​

Escalation​