Cron Job Monitoring

Best practices for monitoring cron jobs with comprehensive error capture

Overview

Cron jobs are the backbone of automated system maintenance, but they often fail silently. This guide shows you how to integrate Telemetry.host monitoring into your cron jobs to catch failures immediately.

Basic Integration

Simple Success/Failure Check

The simplest integration uses the exit code of your script:

# Add to crontab
0 2 * * * /path/to/backup.sh && curl -X POST https://telemetry.host/ping/{MONITOR_ID} || curl -X POST https://telemetry.host/ping/{MONITOR_ID} -d '{"status":"error"}'

This sends a check-in on success (&&) or error (||).

Capture Full Output

For better debugging, capture the full output of your script:

0 2 * * * /path/to/backup.sh 2>&1 | curl -X POST https://telemetry.host/ping/{MONITOR_ID} -H "Content-Type: text/plain" --data-binary @-

This pipes both stdout and stderr (2>&1) directly to your monitor.

Exit Code Handling

Always check exit codes in your scripts:

#!/bin/bash
# backup.sh

set -e  # Exit on any error

# Your backup commands
pg_dump mydb > /backups/mydb.sql
gzip /backups/mydb.sql

# If we get here, everything succeeded
exit 0

Custom Exit Codes

Use different exit codes for different failure types:

#!/bin/bash
# backup.sh

# Validate prerequisites
if [ ! -d "/backups" ]; then
    echo "Backup directory missing"
    exit 1  # Configuration error
fi

if ! pg_isready -q; then
    echo "Database not available"
    exit 2  # Service unavailable
fi

# Perform backup
if ! pg_dump mydb > /backups/mydb.sql; then
    echo "Backup failed"
    exit 3  # Backup error
fi

echo "Backup successful"
exit 0

Wrapper Script Pattern

Create a reusable wrapper for all your cron jobs:

#!/bin/bash
# cron-wrapper.sh
# Usage: cron-wrapper.sh MONITOR_ID command args...

MONITOR_ID="$1"
shift
COMMAND="$@"

# Create temp file for output
TMPFILE=$(mktemp)

# Run command and capture output and exit code
set +e
$COMMAND > "$TMPFILE" 2>&1
EXIT_CODE=$?
set -e

# Read output
OUTPUT=$(cat "$TMPFILE")
rm "$TMPFILE"

# Determine status
if [ $EXIT_CODE -eq 0 ]; then
    STATUS="success"
else
    STATUS="error"
fi

# Send check-in with full output
curl -s -X POST "https://telemetry.host/ping/$MONITOR_ID" \
    -H "Content-Type: application/json" \
    -d "{
        \"status\": \"$STATUS\",
        \"message\": $(echo "$OUTPUT" | jq -Rs .),
        \"exit_code\": $EXIT_CODE
    }"

# Preserve original exit code
exit $EXIT_CODE

Use it in your crontab:

0 2 * * * /usr/local/bin/cron-wrapper.sh {MONITOR_ID} /path/to/backup.sh

Timeout vs Frequency Mode

Use Timeout Mode When:

You know how often the job should run
The schedule is predictable
You want simple configuration

# Daily backup - alert if no check-in within 25 hours
0 2 * * * /path/to/backup.sh 2>&1 | curl -X POST \
  https://telemetry.host/ping/{PROJECT_KEY}/timeout/25h/daily-backup?create=1 \
  -H "Content-Type: text/plain" --data-binary @-

Use Auto Mode When:

The job runs at variable intervals
You’re not sure of the exact schedule
The job adapts based on workload

# Adaptive task
*/5 * * * * /path/to/adaptive-task.sh 2>&1 | curl -X POST \
  https://telemetry.host/ping/{PROJECT_KEY}/auto/adaptive-task?create=1 \
  -H "Content-Type: text/plain" --data-binary @-

Common Patterns

Database Backup

#!/bin/bash
# db-backup.sh

set -e

BACKUP_DIR="/backups"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/mydb_$TIMESTAMP.sql.gz"

# Perform backup
pg_dump mydb | gzip > "$BACKUP_FILE"

# Get backup size
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)

# Report success with metadata
curl -s -X POST https://telemetry.host/ping/{MONITOR_ID} \
    -H "Content-Type: application/json" \
    -d "{
        \"status\": \"success\",
        \"message\": \"Backup completed: $SIZE\",
        \"metadata\": {
            \"file\": \"$BACKUP_FILE\",
            \"size\": \"$SIZE\"
        }
    }"

Log Rotation

#!/bin/bash
# rotate-logs.sh

LOGDIR="/var/log/myapp"
COUNT=$(find "$LOGDIR" -name "*.log" -type f | wc -l)

# Rotate logs
find "$LOGDIR" -name "*.log" -mtime +7 -delete

DELETED=$(($COUNT - $(find "$LOGDIR" -name "*.log" -type f | wc -l)))

curl -s -X POST https://telemetry.host/ping/{MONITOR_ID} \
    -d "Rotated $DELETED log files"

Certificate Renewal

#!/bin/bash
# certbot-renew.sh

OUTPUT=$(certbot renew 2>&1)
EXIT_CODE=$?

if [ $EXIT_CODE -eq 0 ]; then
    # Check if any certs were actually renewed
    if echo "$OUTPUT" | grep -q "Certificate not yet due for renewal"; then
        MESSAGE="All certificates up to date"
        STATUS="success"
    else
        MESSAGE="Certificates renewed successfully"
        STATUS="success"
        # Reload web server
        systemctl reload nginx
    fi
else
    MESSAGE="Certificate renewal failed"
    STATUS="error"
fi

echo "$MESSAGE" | curl -X POST https://telemetry.host/ping/{MONITOR_ID} \
    -H "Content-Type: text/plain" \
    -d "status: $STATUS" \
    --data-binary @-

System Updates

#!/bin/bash
# system-update.sh

set -e

# Update package lists
apt-get update -qq

# List upgradeable packages
UPGRADEABLE=$(apt-get -s upgrade | grep -c "^Inst")

if [ "$UPGRADEABLE" -gt 0 ]; then
    # Perform upgrade
    DEBIAN_FRONTEND=noninteractive apt-get upgrade -y -qq

    curl -s -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -d "Updated $UPGRADEABLE packages"
else
    curl -s -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -d "System up to date, no updates needed"
fi

Capturing stderr Separately

Sometimes you want to treat stderr differently:

#!/bin/bash
# capture-separate.sh

STDOUT_FILE=$(mktemp)
STDERR_FILE=$(mktemp)

# Run command with separate streams
/path/to/command > "$STDOUT_FILE" 2> "$STDERR_FILE"
EXIT_CODE=$?

STDOUT=$(cat "$STDOUT_FILE")
STDERR=$(cat "$STDERR_FILE")

rm "$STDOUT_FILE" "$STDERR_FILE"

if [ $EXIT_CODE -eq 0 ]; then
    # Success - send stdout
    echo "$STDOUT" | curl -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -H "Content-Type: text/plain" --data-binary @-
else
    # Failure - send both
    {
        echo "STDOUT:"
        echo "$STDOUT"
        echo ""
        echo "STDERR:"
        echo "$STDERR"
    } | curl -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -H "Content-Type: text/plain" --data-binary @-
fi

exit $EXIT_CODE

Parallel Jobs

For jobs that run multiple tasks in parallel:

#!/bin/bash
# parallel-jobs.sh

FAILED=0

# Start jobs in background
backup_db &
JOB1=$!

backup_files &
JOB2=$!

sync_remote &
JOB3=$!

# Wait for all jobs
wait $JOB1 || FAILED=$((FAILED+1))
wait $JOB2 || FAILED=$((FAILED+1))
wait $JOB3 || FAILED=$((FAILED+1))

if [ $FAILED -eq 0 ]; then
    curl -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -d '{"status":"success","message":"All tasks completed"}'
else
    curl -X POST https://telemetry.host/ping/{MONITOR_ID} \
        -d "{\"status\":\"error\",\"message\":\"$FAILED tasks failed\"}"
fi

Timeout Handling

Prevent runaway jobs with timeout:

# Use timeout command (coreutils)
0 2 * * * timeout 1h /path/to/backup.sh 2>&1 | curl -X POST https://telemetry.host/ping/{MONITOR_ID} -H "Content-Type: text/plain" --data-binary @-

Or in your script:

#!/bin/bash
# with-timeout.sh

TIMEOUT=3600  # 1 hour in seconds
MONITOR_ID="your_monitor_id_here"

timeout "$TIMEOUT" /path/to/long-running-task
EXIT_CODE=$?

if [ $EXIT_CODE -eq 124 ]; then
    curl -s -o /dev/null -X POST "https://telemetry.host/ping/$MONITOR_ID" \
         -H "Content-Type: application/json" \
         -d '{"status":"error","message":"Task timed out after 1 hour"}'
    exit 1
elif [ $EXIT_CODE -ne 0 ]; then
    curl -s -o /dev/null -X POST "https://telemetry.host/ping/$MONITOR_ID" \
         -H "Content-Type: application/json" \
         -d "{\"status\":\"error\",\"message\":\"Task failed with code $EXIT_CODE\"}"
    exit $EXIT_CODE
fi

# Success
curl -s -o /dev/null -X POST "https://telemetry.host/ping/$MONITOR_ID" \
     -H "Content-Type: application/json" \
     -d '{"status":"ok","message":"Task completed successfully"}'
exit 0

Best Practices

1. Always Capture Output

Don’t just check if the job runs - capture what it does:

# ❌ Bad: No output captured
0 2 * * * /path/to/backup.sh > /dev/null 2>&1 && curl -X POST {MONITOR_URL}

# ✅ Good: Output captured for debugging
0 2 * * * /path/to/backup.sh 2>&1 | curl -X POST {MONITOR_URL} -H "Content-Type: text/plain" --data-binary @-

2. Use Meaningful Messages

Include context in your check-ins:

# ❌ Bad: No context
echo "Done" | curl -X POST {MONITOR_URL} -H "Content-Type: text/plain" --data-binary @-

# ✅ Good: Actionable information
echo "Backup completed: 2.5GB in 120 seconds, 15 tables" | curl -X POST {MONITOR_URL} -H "Content-Type: text/plain" --data-binary @-

3. Set Appropriate Timeouts

Give yourself buffer time:

# If job runs daily at 2 AM, set timeout to 25-26 hours
# This accounts for occasional delays
https://telemetry.host/ping/{PROJECT_KEY}/timeout/26h/daily-backup

4. Test Your Monitoring

Before deploying:

# Test success case
/path/to/script.sh && echo "Test passed"

# Test failure case
/path/to/script.sh --force-error && echo "Failure detected"

# Verify monitoring receives both cases

5. Monitor the Monitor

Set up a separate monitor for critical monitoring scripts:

# Meta-monitoring: ensure monitoring itself works
*/15 * * * * curl -s https://telemetry.host/ping/{META_MONITOR} -d '{"status":"success"}' -o /dev/null

Troubleshooting

Cron Job Runs But No Check-In

Check:

Is curl installed? (which curl)
Does the script have internet access?
Are there firewall rules blocking outbound HTTPS?
Check cron logs: grep CRON /var/log/syslog

Monitor Shows Old Check-Ins

Check:

Is the cron job actually running? (crontab -l)
Check cron daemon status: systemctl status cron
Verify monitor ID is correct
Test manually: curl -X POST {MONITOR_URL}

Getting Error Notifications But Job Succeeds

Check:

Exit code: add echo $? after your command
Stderr output: might contain warnings treated as errors
Use wrapper script to inspect actual exit codes

Cron Job Monitoring

Overview

Basic Integration

Simple Success/Failure Check

Capture Full Output

Exit Code Handling

Custom Exit Codes

Wrapper Script Pattern

Timeout vs Frequency Mode

Use Timeout Mode When:

Use Auto Mode When:

Common Patterns

Database Backup

Log Rotation

Certificate Renewal

System Updates

Capturing stderr Separately

Parallel Jobs

Timeout Handling

Best Practices

1. Always Capture Output

2. Use Meaningful Messages

3. Set Appropriate Timeouts

4. Test Your Monitoring

5. Monitor the Monitor

Troubleshooting

Cron Job Runs But No Check-In

Monitor Shows Old Check-Ins

Getting Error Notifications But Job Succeeds

Next Steps

Send Feedback