Docker Health Checks

Monitor Docker containers and orchestration platforms

Overview

Docker containers need monitoring just like traditional services. This guide shows you how to integrate Telemetry.host with Docker health checks, Docker Compose, and Kubernetes.

Docker Container Health Checks

Basic Dockerfile Health Check

Add a health check that also reports to Telemetry.host:

FROM ubuntu:22.04

# Install curl for monitoring
RUN apt-get update && apt-get install -y curl

# Your application setup
COPY app.sh /app.sh
RUN chmod +x /app.sh

# Health check that monitors locally AND reports externally
HEALTHCHECK --interval=5m --timeout=3s \
  CMD /app.sh --health-check && \
      curl -sf -X POST https://telemetry.host/ping/{MONITOR_ID} || exit 1

CMD ["/app.sh"]

Separate Health Check Script

Create a dedicated health check script:

#!/bin/bash
# healthcheck.sh

# Check if application is responding
if curl -sf http://localhost:8080/health > /dev/null; then
    # Application is healthy, report to monitoring
    curl -sf -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -d '{"status":"success","message":"Container healthy"}'
    exit 0
else
    # Application is unhealthy
    curl -sf -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -d '{"status":"error","message":"Container unhealthy"}'
    exit 1
fi

Use in Dockerfile:

COPY healthcheck.sh /healthcheck.sh
RUN chmod +x /healthcheck.sh

HEALTHCHECK --interval=5m --timeout=10s \
  CMD /healthcheck.sh

Docker Compose Integration

Method 1: Health Check in Compose

version: '3.8'

services:
  web:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-sf", "-X", "POST", 
             "https://telemetry.host/ping/${MONITOR_ID}", 
             "-d", '{"status":"success"}']
      interval: 5m
      timeout: 10s
      retries: 3
      start_period: 40s

Method 2: Separate Monitor Container

Create a dedicated monitoring sidecar:

version: '3.8'

services:
  web:
    image: myapp:latest
    
  monitor:
    image: curlimages/curl:latest
    depends_on:
      - web
    environment:
      - MONITOR_ID=${MONITOR_ID}
    command: >
      sh -c "
        while true; do
          if wget -q --spider http://web:8080/health; then
            curl -X POST https://telemetry.host/ping/$$MONITOR_ID -d '{\"status\":\"success\"}';
          else
            curl -X POST https://telemetry.host/ping/$$MONITOR_ID -d '{\"status\":\"error\"}';
          fi
          sleep 300;
        done
      "

Method 3: External Monitoring Script

Run monitoring from the Docker host:

#!/bin/bash
# docker-monitor.sh

CONTAINER_NAME="myapp"
MONITOR_ID="your-monitor-id"

# Check if container is running
if docker ps --filter "name=$CONTAINER_NAME" --filter "status=running" | grep -q "$CONTAINER_NAME"; then
    # Container is running, check health
    HEALTH=$(docker inspect --format='{{.State.Health.Status}}' "$CONTAINER_NAME" 2>/dev/null)
    
    if [ "$HEALTH" = "healthy" ] || [ -z "$HEALTH" ]; then
        curl -X POST https://telemetry.host/ping/$MONITOR_ID \
            -d '{"status":"success","message":"Container running"}'
    else
        curl -X POST https://telemetry.host/ping/$MONITOR_ID \
            -d "{\"status\":\"error\",\"message\":\"Container unhealthy: $HEALTH\"}"
    fi
else
    curl -X POST https://telemetry.host/ping/$MONITOR_ID \
        -d '{"status":"error","message":"Container not running"}'
fi

Add to crontab:

*/5 * * * * /usr/local/bin/docker-monitor.sh

Kubernetes Integration

Liveness Probe with Monitoring

Create a health check endpoint that also reports to monitoring:

# health.py
from flask import Flask, jsonify
import requests
import os

app = Flask(__name__)

MONITOR_URL = os.getenv('TELEMETRY_MONITOR_URL')

@app.route('/health')
def health():
    # Check application health
    healthy = check_app_health()
    
    # Report to monitoring (async in production)
    try:
        if healthy:
            requests.post(MONITOR_URL, json={"status": "success"}, timeout=2)
        else:
            requests.post(MONITOR_URL, json={"status": "error"}, timeout=2)
    except:
        pass  # Don't fail health check if monitoring fails
    
    if healthy:
        return jsonify({"status": "healthy"}), 200
    else:
        return jsonify({"status": "unhealthy"}), 503

def check_app_health():
    # Your health check logic
    return True

Kubernetes deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: myapp
        image: myapp:latest
        env:
        - name: TELEMETRY_MONITOR_URL
          value: "https://telemetry.host/ping/YOUR_MONITOR_ID"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 300  # Every 5 minutes

CronJob Monitoring

Monitor Kubernetes CronJobs:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
spec:
  schedule: "0 2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: postgres:15
            env:
            - name: MONITOR_URL
              value: "https://telemetry.host/ping/PROJECT_KEY/timeout/26h/k8s-backup?create=1"
            command:
            - /bin/sh
            - -c
            - |
              set -e
              pg_dump -h postgres mydb | gzip > /backup/mydb.sql.gz
              echo "Backup completed" | curl -X POST "$MONITOR_URL" \
                -H "Content-Type: text/plain" --data-binary @-
          restartPolicy: OnFailure

Job Success/Failure Monitoring

Monitor job completion:

apiVersion: batch/v1
kind: Job
metadata:
  name: data-migration
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: myapp:latest
        env:
        - name: MONITOR_ID
          value: "YOUR_MONITOR_ID"
        command:
        - /bin/sh
        - -c
        - |
          if /app/migrate.sh; then
            curl -X POST https://telemetry.host/ping/$MONITOR_ID \
              -d '{"status":"success","message":"Migration completed"}'
          else
            curl -X POST https://telemetry.host/ping/$MONITOR_ID \
              -d '{"status":"error","message":"Migration failed"}'
            exit 1
          fi
      restartPolicy: Never
  backoffLimit: 3

Docker Swarm

Monitor services in Docker Swarm:

#!/bin/bash
# swarm-monitor.sh

SERVICE_NAME="myapp"
MONITOR_ID="your-monitor-id"

# Get service status
REPLICAS=$(docker service ls --filter "name=$SERVICE_NAME" --format '{{.Replicas}}')

if echo "$REPLICAS" | grep -q '/'; then
    RUNNING=$(echo "$REPLICAS" | cut -d'/' -f1)
    DESIRED=$(echo "$REPLICAS" | cut -d'/' -f2)
    
    if [ "$RUNNING" = "$DESIRED" ] && [ "$RUNNING" -gt 0 ]; then
        curl -X POST https://telemetry.host/ping/$MONITOR_ID \
            -d "{\"status\":\"success\",\"message\":\"$RUNNING/$DESIRED replicas running\"}"
    else
        curl -X POST https://telemetry.host/ping/$MONITOR_ID \
            -d "{\"status\":\"error\",\"message\":\"Only $RUNNING/$DESIRED replicas running\"}"
    fi
else
    curl -X POST https://telemetry.host/ping/$MONITOR_ID \
        -d '{"status":"error","message":"Service not found"}'
fi

Docker Events Monitoring

Monitor Docker events for container crashes:

#!/usr/bin/env python3
# docker-event-monitor.py

import docker
import requests
import os

client = docker.from_env()
MONITOR_ID = os.getenv('MONITOR_ID')

def send_check_in(status, message):
    try:
        requests.post(
            f'https://telemetry.host/ping/{MONITOR_ID}',
            json={'status': status, 'message': message},
            timeout=5
        )
    except Exception as e:
        print(f"Failed to send check-in: {e}")

# Monitor container events
for event in client.events(decode=True):
    if event['Type'] == 'container':
        status = event['status']
        container_name = event['Actor']['Attributes'].get('name', 'unknown')
        
        if status == 'die':
            exit_code = event['Actor']['Attributes'].get('exitCode', 'unknown')
            if exit_code != '0':
                send_check_in('error', 
                    f"Container {container_name} died with exit code {exit_code}")
        
        elif status == 'health_status: unhealthy':
            send_check_in('error', 
                f"Container {container_name} became unhealthy")

Run as a service:

# /etc/systemd/system/docker-monitor.service
[Unit]
Description=Docker Event Monitor
After=docker.service
Requires=docker.service

[Service]
Environment="MONITOR_ID=your-monitor-id"
ExecStart=/usr/local/bin/docker-event-monitor.py
Restart=always

[Install]
WantedBy=multi-user.target

Best Practices

1. Separate Health Checks from Monitoring

Don’t let monitoring failures affect container health:

# ✅ Good: Health check succeeds even if monitoring fails
curl -f http://localhost:8080/health && \
  (curl -X POST https://telemetry.host/ping/{ID} || true)

2. Use Auto Mode for Scaled Services

For services with auto-scaling:

environment:
  - MONITOR_URL=https://telemetry.host/ping/PROJECT_KEY/auto/scaled-service?create=1

Auto mode adapts to changing check-in frequency as replicas scale.

3. Monitor at Multiple Levels

  • Container level: Individual container health
  • Service level: Overall service availability
  • Job level: Batch job completion

4. Set Appropriate Intervals

Match health check interval to monitoring timeout:

healthcheck:
  interval: 5m  # Check every 5 minutes
  
# Set monitor timeout to 6-7 minutes to allow for missed check
# https://telemetry.host/ping/KEY/timeout/7m/container

5. Include Context in Messages

CONTAINER_ID=$(hostname)
curl -X POST https://telemetry.host/ping/{ID} \
  -d "{\"status\":\"success\",\"message\":\"Container $CONTAINER_ID healthy\"}"

Troubleshooting

Health Checks Pass But No Monitoring

Check:

  • Container has internet access
  • DNS resolution works inside container
  • Firewall allows outbound HTTPS
  • Test manually: docker exec <container> curl https://telemetry.host

Monitoring Works But Health Checks Fail

Check:

  • Health check timeout is sufficient
  • Application starts before first health check
  • start_period is long enough for initialization

Too Many False Positives

Causes:

  • Health check interval too aggressive
  • Network hiccups causing temporary failures
  • Cold starts taking longer than expected

Solutions:

  • Increase interval and timeout
  • Use retries to allow transient failures
  • Increase start_period for slow-starting apps

Next Steps