Tutorial: Monitoring Database Backups with Smart Error Detection
Step-by-step guide to building a production-ready backup monitoring system with intelligent failure detection
Database backups are critical, until they’re not running. In this tutorial, we’ll build a production-ready backup monitoring system that:
- ✅ Captures full backup logs for debugging
- ✅ Detects failures automatically
- ✅ Alerts only on real problems (no false positives)
- ✅ Tracks backup size and duration trends
- ✅ Verifies backup integrity
- ✅ Manages backup retention
By the end, you’ll have a robust backup system that you can trust.
Prerequisites
- PostgreSQL database
- Telemetry.host account (free signup)
- Linux server with cron access
- Basic bash scripting knowledge
Step 1: Create Your Monitor
First, let’s create a monitor for the backup job.
Option A: Via Dashboard
- Log in to telemetry.host
- Click “Add Monitor”
- Fill in:
- Name: “Production DB Backup”
- Mode: Timeout
- Timeout: 26h (daily backup with 2h buffer)
- Copy the monitor URL
Option B: Auto-Provisioning (Recommended)
We’ll use auto-provisioning so the monitor is created automatically on first run:
PROJECT_KEY="your_project_key" # Get from dashboard
MONITOR_URL="https://telemetry.host/ping/${PROJECT_KEY}/timeout/26h/prod-db-backup?create=1"
The monitor will be created when the first check-in arrives.
Step 2: Basic Backup Script
Let’s start with a simple backup script:
#!/bin/bash
# /usr/local/bin/db-backup.sh
set -euo pipefail
# Configuration
DB_NAME="production"
DB_USER="postgres"
BACKUP_DIR="/backups/postgres"
MONITOR_URL="https://telemetry.host/ping/PROJECT_KEY/timeout/26h/prod-backup?create=1"
# Ensure backup directory exists
mkdir -p "$BACKUP_DIR"
# Create backup with timestamp
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.sql.gz"
# Perform backup
echo "Starting backup of $DB_NAME..."
pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"
# Report success
curl -X POST "$MONITOR_URL" \
-H "Content-Type: application/json" \
-d "{\"status\":\"success\",\"message\":\"Backup completed\"}"
echo "Backup completed: $BACKUP_FILE"
Make it executable:
chmod +x /usr/local/bin/db-backup.sh
Step 3: Add Error Handling
Now let’s make it production-ready with proper error handling:
#!/bin/bash
# /usr/local/bin/db-backup.sh
set -euo pipefail
# Configuration
DB_NAME="production"
DB_USER="postgres"
BACKUP_DIR="/backups/postgres"
MONITOR_URL="https://telemetry.host/ping/PROJECT_KEY/timeout/26h/prod-backup?create=1"
# Create log file for this run
LOG_FILE="/tmp/backup_$(date +%Y%m%d_%H%M%S).log"
# Function to report and cleanup
cleanup() {
local exit_code=$?
if [ $exit_code -eq 0 ]; then
# Success - send log
cat "$LOG_FILE" | curl -X POST "$MONITOR_URL" \
-H "Content-Type: text/plain" \
--data-binary @-
else
# Failure - send error log
{
echo "❌ BACKUP FAILED"
echo "Exit code: $exit_code"
echo ""
cat "$LOG_FILE"
} | curl -X POST "$MONITOR_URL" \
-H "Content-Type: text/plain" \
--data-binary @-
fi
rm -f "$LOG_FILE"
}
trap cleanup EXIT
# Redirect all output to log
exec > >(tee "$LOG_FILE") 2>&1
echo "=== Database Backup Started at $(date) ==="
# Check prerequisites
echo "Checking prerequisites..."
if [ ! -d "$BACKUP_DIR" ]; then
echo "❌ ERROR: Backup directory $BACKUP_DIR does not exist"
exit 1
fi
if ! pg_isready -U "$DB_USER" -q; then
echo "❌ ERROR: PostgreSQL is not ready"
exit 1
fi
# Create backup
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.sql.gz"
START_TIME=$(date +%s)
echo "Creating backup: $BACKUP_FILE"
pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
echo "✅ Backup created in ${DURATION}s"
echo "=== Backup Completed at $(date) ==="
Step 4: Add Backup Verification
Verify the backup is valid before reporting success:
# Add after backup creation
echo "Verifying backup integrity..."
if gunzip -t "$BACKUP_FILE" 2>/dev/null; then
echo "✅ Backup verification passed"
else
echo "❌ ERROR: Backup verification failed"
exit 1
fi
# Get backup info
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
echo "Backup size: $SIZE"
Step 5: Add Metadata and Metrics
Send structured data for better tracking:
# Replace the success report with:
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
SIZE_BYTES=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")
# Count tables backed up
TABLES=$(pg_dump -U "$DB_USER" "$DB_NAME" --schema-only | grep "CREATE TABLE" | wc -l)
# Report with metadata
curl -X POST "$MONITOR_URL" \
-H "Content-Type: application/json" \
-d "{
\"status\": \"success\",
\"message\": \"Backup completed: $SIZE in ${DURATION}s\",
\"duration\": $DURATION,
\"metadata\": {
\"file\": \"$BACKUP_FILE\",
\"size\": \"$SIZE\",
\"size_bytes\": $SIZE_BYTES,
\"tables\": $TABLES,
\"database\": \"$DB_NAME\"
}
}"
Step 6: Implement Retention Policy
Clean up old backups automatically:
# Add before final report
RETENTION_DAYS=30
echo "Cleaning up backups older than $RETENTION_DAYS days..."
BEFORE_COUNT=$(find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" | wc -l)
find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" -mtime +$RETENTION_DAYS -delete
AFTER_COUNT=$(find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" | wc -l)
DELETED=$((BEFORE_COUNT - AFTER_COUNT))
echo "Deleted $DELETED old backups, $AFTER_COUNT retained"
# Add to metadata:
# "backups_retained": $AFTER_COUNT,
# "backups_deleted": $DELETED
Step 7: Add Retry Logic
Handle transient failures gracefully:
# Add retry function
backup_with_retry() {
local max_attempts=3
local attempt=1
local delay=60
while [ $attempt -le $max_attempts ]; do
echo "Backup attempt $attempt of $max_attempts..."
if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"; then
return 0
fi
echo "⚠️ Attempt $attempt failed"
attempt=$((attempt + 1))
if [ $attempt -le $max_attempts ]; then
echo "Retrying in ${delay}s..."
sleep $delay
fi
done
return 1
}
# Use it:
if ! backup_with_retry; then
echo "❌ ERROR: Backup failed after $max_attempts attempts"
exit 1
fi
Step 8: Schedule with Cron
Add to crontab:
crontab -e
# Add this line:
0 2 * * * /usr/local/bin/db-backup.sh
This runs the backup daily at 2 AM.
Step 9: Test Your Setup
Test Success Case
sudo -u postgres /usr/local/bin/db-backup.sh
Check:
- ✅ Backup file created
- ✅ Check-in appears in dashboard
- ✅ Status shows “UP”
Test Failure Case
Simulate a failure:
# Temporarily make backup dir unwritable
sudo chmod 000 /backups/postgres
sudo -u postgres /usr/local/bin/db-backup.sh
sudo chmod 755 /backups/postgres
Check:
- ✅ Error logged in dashboard
- ✅ Alert notification received
- ✅ Monitor status shows “DOWN”
Step 10: Set Up Notifications
Configure alerts for failures:
- Go to monitor settings
- Add Discord webhook or email
- Set to “Failures only”
- Test notification
- Configure team access if needed
Complete Production Script
Here’s the final version with all features:
#!/bin/bash
# /usr/local/bin/db-backup.sh
# Production-ready PostgreSQL backup with monitoring
set -euo pipefail
# === Configuration ===
DB_NAME="${DB_NAME:-production}"
DB_USER="${DB_USER:-postgres}"
BACKUP_DIR="${BACKUP_DIR:-/backups/postgres}"
MONITOR_URL="${MONITOR_URL:-https://telemetry.host/ping/KEY/timeout/26h/backup?create=1}"
RETENTION_DAYS=30
MAX_RETRIES=3
RETRY_DELAY=60
# === Setup ===
LOG_FILE="/tmp/backup_$(date +%Y%m%d_%H%M%S).log"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/${DB_NAME}_${TIMESTAMP}.sql.gz"
# === Functions ===
cleanup() {
local exit_code=$?
local status="success"
local message="Backup completed"
if [ $exit_code -ne 0 ]; then
status="error"
message="Backup failed (exit code: $exit_code)"
fi
# Send report with full log
{
echo "Status: $status"
echo "Message: $message"
echo ""
cat "$LOG_FILE"
} | curl -s -X POST "$MONITOR_URL" \
-H "Content-Type: text/plain" \
--data-binary @-
rm -f "$LOG_FILE"
}
trap cleanup EXIT
exec > >(tee "$LOG_FILE") 2>&1
backup_with_retry() {
for attempt in $(seq 1 $MAX_RETRIES); do
echo "Attempt $attempt of $MAX_RETRIES..."
if pg_dump -U "$DB_USER" "$DB_NAME" | gzip > "$BACKUP_FILE"; then
return 0
fi
echo "Failed, retrying in ${RETRY_DELAY}s..."
[ $attempt -lt $MAX_RETRIES ] && sleep $RETRY_DELAY
done
return 1
}
# === Main Script ===
echo "=== Backup Started: $(date) ==="
# Validate
[ -d "$BACKUP_DIR" ] || { echo "ERROR: Backup dir missing"; exit 1; }
pg_isready -U "$DB_USER" -q || { echo "ERROR: DB not ready"; exit 1; }
# Backup
START=$(date +%s)
backup_with_retry || { echo "ERROR: Backup failed"; exit 1; }
DURATION=$(($(date +%s) - START))
# Verify
gunzip -t "$BACKUP_FILE" || { echo "ERROR: Verification failed"; exit 1; }
# Info
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
echo "✅ Backup: $SIZE in ${DURATION}s"
# Cleanup
find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" -mtime +$RETENTION_DAYS -delete
RETAINED=$(find "$BACKUP_DIR" -name "${DB_NAME}_*.sql.gz" | wc -l)
echo "Retained $RETAINED backups"
echo "=== Completed: $(date) ==="
Monitoring Dashboard
Your dashboard now shows:
- Last backup time
- Backup size trends
- Duration trends
- Success rate
- Alert history
Troubleshooting
Backup Runs But Not Monitored
Check:
- curl is installed:
which curl - Server has internet access
- Monitor URL is correct
- No firewall blocking HTTPS
False Positives
Check:
- Timeout is appropriate (26h for daily)
- Cron schedule matches timeout
- Server time is correct:
date
Missing Notifications
Check:
- Notification channel configured
- Test notification works
- Check spam folder (email)
- Webhook URL is valid (Discord/Slack)
Next Steps
- Add offsite backup copying
- Monitor backup restoration tests
- Set up team collaboration
- Explore multiple database backups
Conclusion
You now have a production-ready backup system with comprehensive monitoring. Your backups will never fail silently again, and when they do fail, you’ll have all the context you need to fix them quickly.
Questions? Use the feedback button in the dashboard or check our documentation.
Happy monitoring! 🚀