Skip to main content

Pro Tips — Docker Environment Operations

Running Docker containers and running them reliably are two different things. This chapter systematically covers the techniques that production experts use: health checks, rolling updates, resource limits, security hardening, and vulnerability scanning — everything you need for stable container operations.


Health Checks: The HEALTHCHECK Directive

Docker doesn't know if the actual service is healthy even when the container process is running. The HEALTHCHECK directive lets Docker periodically inspect container state and mark it as healthy or unhealthy.

Defining HEALTHCHECK in a Dockerfile

# Node.js app example
FROM node:20-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .

EXPOSE 8080

# Health check configuration
HEALTHCHECK --interval=30s \
--timeout=10s \
--start-period=40s \
--retries=3 \
CMD curl -f http://localhost:8080/health || exit 1

CMD ["node", "server.js"]
# Nginx example
FROM nginx:1.25-alpine

# Alpine images lack curl — use wget instead
HEALTHCHECK --interval=30s \
--timeout=5s \
--start-period=10s \
--retries=3 \
CMD wget -q --spider http://localhost/health || exit 1
OptionDefaultDescription
--interval30sInterval between health checks
--timeout30sTimeout for individual checks
--start-period0sWait time after container start before first check
--retries3Consecutive failures before switching to unhealthy

healthcheck Configuration in docker-compose.yml

Even without a HEALTHCHECK in the Dockerfile, you can override it in docker-compose.yml:

services:
app:
image: my-app:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

db:
image: postgres:16-alpine
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER} -d ${POSTGRES_DB}"]
interval: 10s
timeout: 5s
retries: 5

redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3

Checking health check status:

# Check container status (STATUS column shows healthy/unhealthy)
docker ps

# View detailed health check history
docker inspect --format='{{json .State.Health}}' app | python -m json.tool

depends_on + condition: service_healthy Pattern

Using depends_on alone only waits for the container to start (not healthy) — it doesn't guarantee the service is ready. Using condition: service_healthy ensures the next container starts only after the dependency reaches healthy status.

version: "3.9"

services:
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: mydb
POSTGRES_USER: user
POSTGRES_PASSWORD: password
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
interval: 10s
timeout: 5s
retries: 5
start_period: 30s

redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 3s
retries: 3

app:
image: my-app:latest
depends_on:
db:
condition: service_healthy # Start only when DB is healthy
redis:
condition: service_healthy # Start only when Redis is healthy
environment:
DATABASE_URL: postgresql://user:password@db:5432/mydb
REDIS_URL: redis://redis:6379

nginx:
image: nginx:1.25-alpine
depends_on:
app:
condition: service_healthy # Start only when app is healthy
ports:
- "80:80"

Rolling Updates: docker compose up --no-deps --build

A pattern for zero-downtime service updates using docker compose on a single server.

# Rebuild and restart only the app (without touching other services)
docker compose up -d --no-deps --build app

# Update multiple services sequentially
docker compose up -d --no-deps --build nginx app

# Check status after update
docker compose ps
docker compose logs -f app --tail 50

--no-deps: Don't restart dependent services (keeps db, redis, etc. running) --build: Rebuild the image, then restart the container

Blue-Green Deployment Script

#!/bin/bash
# blue-green-deploy.sh

set -e

IMAGE_NAME="my-app"
IMAGE_TAG="${1:-latest}"

echo "==> Building new image: ${IMAGE_NAME}:${IMAGE_TAG}"
docker build -t "${IMAGE_NAME}:${IMAGE_TAG}" .

echo "==> Replacing with new container (zero-downtime)"
docker compose up -d --no-deps app

echo "==> Waiting for health check (up to 60 seconds)"
for i in $(seq 1 12); do
STATUS=$(docker inspect --format='{{.State.Health.Status}}' app 2>/dev/null || echo "unknown")
if [ "$STATUS" = "healthy" ]; then
echo "==> App is healthy."
break
fi
echo " Waiting... (${i}/12) Current status: ${STATUS}"
sleep 5
done

echo "==> Reloading Nginx"
docker compose exec nginx nginx -s reload

echo "==> Cleaning up old images"
docker image prune -f

echo "==> Deployment complete"

Docker Swarm Rolling Updates

In Docker Swarm mode, rolling updates are supported natively when updating services.

# docker-compose.swarm.yml
version: "3.9"

services:
app:
image: my-app:latest
deploy:
replicas: 3
update_config:
parallelism: 1 # Update 1 container at a time
delay: 10s # Wait time between each update
failure_action: rollback # Auto-rollback on failure
monitor: 60s # Monitoring time after update
max_failure_ratio: 0.1 # Rollback if more than 10% fail
rollback_config:
parallelism: 1
delay: 5s
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
# Swarm service rolling update
docker service update \
--image my-app:v2.0 \
--update-parallelism 1 \
--update-delay 10s \
my_app

# Check update status
docker service ps my_app

# Rollback
docker service rollback my_app

Resource Limits: deploy.resources and ulimits

Containers that consume unlimited resources affect the entire host server. Resource limits are core to service stability.

version: "3.9"

services:
app:
image: my-app:latest
deploy:
resources:
limits:
cpus: "1.0" # Max 1 CPU core
memory: 512M # Max 512MB memory
reservations:
cpus: "0.25" # Minimum guaranteed CPU
memory: 128M # Minimum guaranteed memory
ulimits:
nofile:
soft: 65535 # File descriptor limit (soft)
hard: 65535 # File descriptor limit (hard)
nproc:
soft: 4096
hard: 4096

db:
image: postgres:16-alpine
deploy:
resources:
limits:
cpus: "2.0"
memory: 2G
reservations:
cpus: "0.5"
memory: 512M
shm_size: "256m" # PostgreSQL shared memory

To apply deploy.resources on a single host, use the --compatibility flag:

docker compose --compatibility up -d

Or use mem_limit and cpus fields directly in newer Compose versions:

services:
app:
image: my-app:latest
mem_limit: 512m
cpus: 1.0
mem_reservation: 128m

Container Security Hardening

Running as Non-Root User

FROM node:20-alpine

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

COPY --chown=node:node . .

# Run as node user, not root
USER node

EXPOSE 8080
CMD ["node", "server.js"]
services:
app:
image: my-app:latest
user: "1000:1000" # Specify UID:GID directly

Read-Only Filesystem

services:
app:
image: my-app:latest
read_only: true # Container filesystem is read-only
tmpfs:
- /tmp # Mount only paths needing temp files as tmpfs
- /var/run

Limiting Linux Capabilities

services:
nginx:
image: nginx:1.25-alpine
cap_drop:
- ALL # Remove all capabilities
cap_add:
- NET_BIND_SERVICE # Add only what's needed for port 80/443 binding
security_opt:
- no-new-privileges:true # Prevent privilege escalation

Applying seccomp Profiles

services:
app:
image: my-app:latest
security_opt:
- seccomp:./seccomp/app-profile.json
- no-new-privileges:true

Image Vulnerability Scanning

Docker Scout

# Scan image with Docker Scout
docker scout cves my-app:latest

# Filter by severity
docker scout cves --only-severity critical,high my-app:latest

# Show only fixable vulnerabilities
docker scout cves --only-fixable my-app:latest

# Generate SBOM (Software Bill of Materials)
docker scout sbom my-app:latest

Trivy (Open-Source Vulnerability Scanner)

# Install Trivy (Ubuntu/Debian)
sudo apt-get install -y trivy

# Scan image
trivy image my-app:latest

# Filter by severity (CRITICAL, HIGH only)
trivy image --severity CRITICAL,HIGH my-app:latest

# CI/CD pipeline mode (return result as exit code)
trivy image --exit-code 1 --severity CRITICAL my-app:latest

# Filesystem scan (includes Dockerfile, dependency files)
trivy fs .

# Generate JSON report
trivy image --format json --output report.json my-app:latest

Trivy auto-scan in CI pipeline (GitHub Actions):

# .github/workflows/security-scan.yml
name: Security Scan

on: [push, pull_request]

jobs:
trivy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Build image
run: docker build -t my-app:${{ github.sha }} .

- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: my-app:${{ github.sha }}
format: table
exit-code: 1
severity: CRITICAL,HIGH

.env File Security and Docker Secrets

.env files are convenient for development, but Docker Secrets are recommended for production.

Add .env to .gitignore

.env
.env.production
.env.local
*.pem
*.key

Docker Secrets (Swarm Mode)

# Create a secret
echo "my-secret-password" | docker secret create db_password -
cat ./ssl/privkey.pem | docker secret create ssl_key -

# List secrets
docker secret ls
# docker-compose.swarm.yml
version: "3.9"

services:
app:
image: my-app:latest
secrets:
- db_password
- ssl_key
environment:
# Secrets are mounted as files under /run/secrets/
DB_PASSWORD_FILE: /run/secrets/db_password

secrets:
db_password:
external: true
ssl_key:
external: true

Useful Debugging Commands

# Check status of all running containers
docker ps -a --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Execute commands inside a container
docker exec -it app sh
docker exec -it app bash

# Real-time resource usage monitoring
docker stats
docker stats --format "table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}"

# View detailed container information
docker inspect app

# Check container IP address
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' app

# View container filesystem changes
docker diff app

# Analyze image layers
docker history my-app:latest --no-trunc

# Volume usage
docker volume ls
docker volume inspect my_volume

# Network inspection
docker network ls
docker network inspect my_network

# Clean up unused resources
docker system prune -f # Stopped containers, unused networks, dangling images
docker system prune --volumes -f # Include volumes (CAUTION: deletes data)
docker image prune -a -f # Delete all unused images

# Disk usage analysis
docker system df
docker system df -v

Production Checklist

Items to verify before deploying to production.

# Recommended production docker-compose.yml pattern
version: "3.9"

services:
app:
image: my-app:${IMAGE_TAG:-latest}
restart: unless-stopped # restart policy configured
read_only: true # read-only filesystem
user: "1000:1000" # non-root user
security_opt:
- no-new-privileges:true # prevent privilege escalation
cap_drop:
- ALL # remove unnecessary capabilities
mem_limit: 512m # memory limit
cpus: 1.0 # CPU limit
healthcheck: # health check configured
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
logging: # log driver and rotation
driver: "json-file"
options:
max-size: "10m"
max-file: "5"
environment:
- NODE_ENV=production
env_file:
- .env.production # separate environment variable file
tmpfs:
- /tmp # tmpfs for temp directory
deploy:
resources:
limits:
memory: 512M
cpus: "1.0"

Checklist summary:

  • restart: unless-stopped or always configured
  • Memory and CPU resource limits set
  • Health check (HEALTHCHECK) configured
  • Log driver max-size, max-file rotation set
  • Process runs as non-root user
  • read_only: true filesystem applied
  • no-new-privileges:true security option set
  • .env file added to .gitignore
  • Image vulnerability scanning (Trivy/Scout) integrated in CI
  • depends_on + condition: service_healthy for startup order guarantee
  • Volume data backup strategy established
  • Network internal: true to minimize external exposure