Tutorial: Monitoring Database Backups with Smart Error Detection

Database backups are critical, until they’re not running. In this tutorial, we’ll build a production-ready backup monitoring system that:

✅ Captures full backup logs for debugging
✅ Detects failures automatically
✅ Alerts only on real problems (no false positives)
✅ Tracks backup size and duration trends
✅ Verifies backup integrity
✅ Manages backup retention

By the end, you’ll have a robust backup system that you can trust.

Prerequisites

PostgreSQL database
Telemetry.host account (free signup)
Linux server with cron access
Basic bash scripting knowledge

Step 1: Create Your Monitor

First, let’s create a monitor for the backup job.

Option A: Via Dashboard

Log in to telemetry.host
Click “Add Monitor”
Fill in:
- Name: “Production DB Backup”
- Mode: Timeout
- Timeout: 26h (daily backup with 2h buffer)
Copy the monitor URL

Option B: Auto-Provisioning (Recommended)

We’ll use auto-provisioning so the monitor is created automatically on first run:

PROJECT_KEY="your_project_key"  # Get from dashboard
MONITOR_URL="https://telemetry.host/ping/${PROJECT_KEY}/timeout/26h/prod-db-backup?create=1"

The monitor will be created when the first check-in arrives.

Step 2: Basic Backup Script

Let’s start with a simple backup script:

#!/bin/bash
# /usr/local/bin/db-backup.sh

set -euo pipefail

# Configuration
DB_NAME="production"
DB_USER="postgres"
BACKUP_DIR="/backups/postgres"
MONITOR_URL="https://telemetry.host/ping/PROJECT_KEY/timeout/26h/prod-backup?create=1"

# Ensure backup directory exists
mkdir -p "$BACKUP_DIR"

# Create backup with timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.sql.gz"

# Perform backup
echo "Starting backup of $DB_NAME..."
pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"

# Report success
curl -X POST "$MONITOR_URL" \
    -H "Content-Type: application/json" \
    -d "{\"status\":\"success\",\"message\":\"Backup completed\"}"

echo "Backup completed: $BACKUP_FILE"

Make it executable:

chmod +x /usr/local/bin/db-backup.sh

Step 3: Add Error Handling

Now let’s make it production-ready with proper error handling:

#!/bin/bash
# /usr/local/bin/db-backup.sh

set -euo pipefail

# Configuration
DB_NAME="production"
DB_USER="postgres"
BACKUP_DIR="/backups/postgres"
MONITOR_URL="https://telemetry.host/ping/PROJECT_KEY/timeout/26h/prod-backup?create=1"

# Create log file for this run
LOG_FILE="/tmp/backup_$(date +%Y%m%d_%H%M%S).log"

# Function to report and cleanup
cleanup() {
    local exit_code=$?
    
    if [ $exit_code -eq 0 ]; then
        # Success - send log
        cat "$LOG_FILE" | curl -X POST "$MONITOR_URL" \
            -H "Content-Type: text/plain" \
            --data-binary @-
    else
        # Failure - send error log
        {
            echo "❌ BACKUP FAILED"
            echo "Exit code: $exit_code"
            echo ""
            cat "$LOG_FILE"
        } | curl -X POST "$MONITOR_URL" \
            -H "Content-Type: text/plain" \
            --data-binary @-
    fi
    
    rm -f "$LOG_FILE"
}

trap cleanup EXIT

# Redirect all output to log
exec > >(tee "$LOG_FILE") 2>&1

echo "=== Database Backup Started at $(date) ==="

# Check prerequisites
echo "Checking prerequisites..."
if [ ! -d "$BACKUP_DIR" ]; then
    echo "❌ ERROR: Backup directory $BACKUP_DIR does not exist"
    exit 1
fi

if ! pg_isready -U "$DB_USER" -q; then
    echo "❌ ERROR: PostgreSQL is not ready"
    exit 1
fi

# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.sql.gz"
START_TIME=$(date +%s)

echo "Creating backup: $BACKUP_FILE"
pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))

echo "✅ Backup created in ${DURATION}s"
echo "=== Backup Completed at $(date) ==="

Step 4: Add Backup Verification

Verify the backup is valid before reporting success:

# Add after backup creation

echo "Verifying backup integrity..."
if gunzip -t "$BACKUP_FILE" 2>/dev/null; then
    echo "✅ Backup verification passed"
else
    echo "❌ ERROR: Backup verification failed"
    exit 1
fi

# Get backup info
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
echo "Backup size: $SIZE"

Step 5: Add Metadata and Metrics

Send structured data for better tracking:

# Replace the success report with:

SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
SIZE_BYTES=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")

# Count tables backed up
TABLES=$(pg_dump -U "$DB_USER" "$DB_NAME" --schema-only | grep "CREATE TABLE" | wc -l)

# Report with metadata
curl -X POST "$MONITOR_URL" \
    -H "Content-Type: application/json" \
    -d "{
        \"status\": \"success\",
        \"message\": \"Backup completed: $SIZE in ${DURATION}s\",
        \"duration\": $DURATION,
        \"metadata\": {
            \"file\": \"$BACKUP_FILE\",
            \"size\": \"$SIZE\",
            \"size_bytes\": $SIZE_BYTES,
            \"tables\": $TABLES,
            \"database\": \"$DB_NAME\"
        }
    }"

Step 6: Implement Retention Policy

Clean up old backups automatically:

# Add before final report

RETENTION_DAYS=30
echo "Cleaning up backups older than $RETENTION_DAYS days..."

BEFORE_COUNT=$(find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" | wc -l)
find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" -mtime +$RETENTION_DAYS -delete
AFTER_COUNT=$(find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" | wc -l)
DELETED=$((BEFORE_COUNT - AFTER_COUNT))

echo "Deleted $DELETED old backups, $AFTER_COUNT retained"

# Add to metadata:
# "backups_retained": $AFTER_COUNT,
# "backups_deleted": $DELETED

Step 7: Add Retry Logic

Handle transient failures gracefully:

# Add retry function
backup_with_retry() {
    local max_attempts=3
    local attempt=1
    local delay=60
    
    while [ $attempt -le $max_attempts ]; do
        echo "Backup attempt $attempt of $max_attempts..."
        
        if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"; then
            return 0
        fi
        
        echo "⚠️  Attempt $attempt failed"
        attempt=$((attempt + 1))
        
        if [ $attempt -le $max_attempts ]; then
            echo "Retrying in ${delay}s..."
            sleep $delay
        fi
    done
    
    return 1
}

# Use it:
if ! backup_with_retry; then
    echo "❌ ERROR: Backup failed after $max_attempts attempts"
    exit 1
fi

Step 8: Schedule with Cron

Add to crontab:

crontab -e

# Add this line:
0 2 * * * /usr/local/bin/db-backup.sh

This runs the backup daily at 2 AM.

Step 9: Test Your Setup

Test Success Case

sudo -u postgres /usr/local/bin/db-backup.sh

Check:

✅ Backup file created
✅ Check-in appears in dashboard
✅ Status shows “UP”

Test Failure Case

Simulate a failure:

# Temporarily make backup dir unwritable
sudo chmod 000 /backups/postgres
sudo -u postgres /usr/local/bin/db-backup.sh
sudo chmod 755 /backups/postgres

Check:

✅ Error logged in dashboard
✅ Alert notification received
✅ Monitor status shows “DOWN”

Step 10: Set Up Notifications

Configure alerts for failures:

Go to monitor settings
Add Discord webhook or email
Set to “Failures only”
Test notification
Configure team access if needed

Complete Production Script

Here’s the final version with all features:

#!/bin/bash
# /usr/local/bin/db-backup.sh
# Production-ready PostgreSQL backup with monitoring

set -euo pipefail

# === Configuration ===
DB_NAME="${DB_NAME:-production}"
DB_USER="${DB_USER:-postgres}"
BACKUP_DIR="${BACKUP_DIR:-/backups/postgres}"
MONITOR_URL="${MONITOR_URL:-https://telemetry.host/ping/KEY/timeout/26h/backup?create=1}"
RETENTION_DAYS=30
MAX_RETRIES=3
RETRY_DELAY=60

# === Setup ===
LOG_FILE="/tmp/backup_$(date +%Y%m%d_%H%M%S).log"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.sql.gz"

# === Functions ===
cleanup() {
    local exit_code=$?
    local status="success"
    local message="Backup completed"
    
    if [ $exit_code -ne 0 ]; then
        status="error"
        message="Backup failed (exit code: $exit_code)"
    fi
    
    # Send report with full log
    {
        echo "Status: $status"
        echo "Message: $message"
        echo ""
        cat "$LOG_FILE"
    } | curl -s -X POST "$MONITOR_URL" \
        -H "Content-Type: text/plain" \
        --data-binary @-
    
    rm -f "$LOG_FILE"
}

trap cleanup EXIT
exec > >(tee "$LOG_FILE") 2>&1

backup_with_retry() {
    for attempt in $(seq 1 $MAX_RETRIES); do
        echo "Attempt $attempt of $MAX_RETRIES..."
        if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"; then
            return 0
        fi
        echo "Failed, retrying in ${RETRY_DELAY}s..."
        [ $attempt -lt $MAX_RETRIES ] && sleep $RETRY_DELAY
    done
    return 1
}

# === Main Script ===
echo "=== Backup Started: $(date) ==="

# Validate
[ -d "$BACKUP_DIR" ] || { echo "ERROR: Backup dir missing"; exit 1; }
pg_isready -U "$DB_USER" -q || { echo "ERROR: DB not ready"; exit 1; }

# Backup
START=$(date +%s)
backup_with_retry || { echo "ERROR: Backup failed"; exit 1; }
DURATION=$(($(date +%s) - START))

# Verify
gunzip -t "$BACKUP_FILE" || { echo "ERROR: Verification failed"; exit 1; }

# Info
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
echo "✅ Backup: $SIZE in ${DURATION}s"

# Cleanup
find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" -mtime +$RETENTION_DAYS -delete
RETAINED=$(find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" | wc -l)
echo "Retained $RETAINED backups"

echo "=== Completed: $(date) ==="

Monitoring Dashboard

Your dashboard now shows:

Last backup time
Backup size trends
Duration trends
Success rate
Alert history

Troubleshooting

Backup Runs But Not Monitored

Check:

curl is installed: which curl
Server has internet access
Monitor URL is correct
No firewall blocking HTTPS

False Positives

Check:

Timeout is appropriate (26h for daily)
Cron schedule matches timeout
Server time is correct: date

Missing Notifications

Check:

Notification channel configured
Test notification works
Check spam folder (email)
Webhook URL is valid (Discord/Slack)

Next Steps

Conclusion

You now have a production-ready backup system with comprehensive monitoring. Your backups will never fail silently again, and when they do fail, you’ll have all the context you need to fix them quickly.

Questions? Use the feedback button in the dashboard or check our documentation.

Happy monitoring! 🚀