Backup and Recovery Performance

infoProactive Health

Need to validate backup strategy and understand recovery time objectives are being met.

Prompt: “I need to verify our PostgreSQL backup strategy is working well. How do I assess backup performance, storage costs, and whether we can actually meet our recovery time objectives?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

10,915vs3,119

Latency

2.8mmsvs1.2mms

Turns

13vs23

Tool Calls

7vs10

Schema Calls

0vs7

Response

1,953 charsvs5,529 chars

Evaluation

Without Schema

B-

With Schema

A-

The response without Schema creates automated scripts but doesn't show their content inline, offering a toolkit approach that appears generic rather than PostgreSQL-specific. The response with Schema provides significantly deeper technical guidance with actual SQL queries visible (`postgresql.database.size`, pg_stat_user_tables, WAL replay monitoring), specific tool recommendations (pgBackRest, Barman with performance characteristics), and a practical week-by-week workflow. The Schema-enhanced response demonstrates PostgreSQL expertise through specific metrics, common issues from the knowledge base, and RTO calculation formulas, while the without-Schema response focuses on automation without showing technical depth. Both are actionable, but the with-Schema response would more reliably guide an operator toward PostgreSQL best practices with concrete technical details.

Without Schema

7 tool calls

I'll help you assess your PostgreSQL backup strategy. Let me first check what backup setup you currently have in place.
I don't see any existing backup configuration in your directory. Let me create a comprehensive PostgreSQL backup assessment toolkit that will help you evaluate all three areas you mentioned.
I've created a comprehensive PostgreSQL backup assessment toolkit for you! Here's what I've prepared:

## 📦 Three Key Files:

1. **`backup_assessment.sh`** - Automated testing script that:
   - Measures backup performance (pg_dump, compressed, physical backups)
   - Calculates storage costs for different retention strategies
   - Tests actual recovery time (RTO) by performing real restores
   - Provides health check against best practices

2. **`backup_monitoring.sql`** - Production monitoring queries for:
   - Database growth tracking
   - WAL archiving status
   - Backup window analysis
   - Point-in-time recovery capability checks
   - Storage optimization opportunities

3. **`README.md`** - Complete documentation with:
   - Quick start guide
   - Metric interpretation
   - Best practices checklist
   - Common issues & solutions
   - Automation examples

## 🎯 How to Use:

```bash
# 1. Set your database connection
export DB_HOST="localhost"
export DB_NAME="your_database"
export DB_USER="postgres"

# 2. Run complete assessment
chmod +x backup_assessment.sh
./backup_assessment.sh all
```

This will give you concrete answers to your three questions:

- **Performance**: Actual throughput (MB/s), duration, and compression ratios
- **Costs**: Storage requirements and monthly costs for S3 tiers
- **RTO**: Real recovery time from actual restore test

The script rates your RTO as Excellent (<15 min), Good (<1 hr), Acceptable (<4 hrs), or Poor (>4 hrs) based on actual measurements, not estimates.

Would you like me to help you customize this for a specific database, or would you like guidance on interpreting the results?

With Schema

10 tool calls7 Schema

I'll help you assess your PostgreSQL backup strategy. Let me get expert guidance on backup performance, storage optimization, and RTO validation.
Let me get the detailed scenario information along with key metrics and insights:
Let me search for more specific backup and recovery documentation:
Based on the expert guidance from Schema.ai's PostgreSQL knowledge base, here's how to assess your backup strategy comprehensively:

## Key Assessment Areas

### 1. **Backup Performance Assessment**

**Primary metric to track:** `postgresql.database.size` - This determines backup duration and is critical for estimating recovery time objectives.

**What to measure:**
- **Backup duration trends** - Track how long backups take over time as your database grows
- **Backup throughput** - Calculate MB/s or GB/hour to understand if performance degrades
- **I/O impact** - Monitor disk I/O and CPU during backup windows to ensure production isn't affected

**Commands to run:**
```sql
-- Get current database size
SELECT pg_database.datname,
       pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

-- Get individual table sizes (largest tables)
SELECT schemaname,
       tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size,
       pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;
```

### 2. **Storage Cost Optimization**

**Key factors:**
- **Compression ratios** - Modern backup tools (pgBackRest, Barman) offer compression; measure actual vs uncompressed sizes
- **Incremental vs full backups** - Track the differential between full and incremental backup sizes
- **Retention policy efficiency** - Balance RTO/RPO requirements against storage costs
- **WAL archive growth** - Monitor WAL generation rate for PITR capability

**Calculate storage costs:**
```sql
-- Estimate daily data change rate
SELECT 
    schemaname,
    tablename,
    n_tup_ins + n_tup_upd + n_tup_del as daily_changes,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as current_size
FROM pg_stat_user_tables
WHERE n_tup_ins + n_tup_upd + n_tup_del > 0
ORDER BY daily_changes DESC
LIMIT 20;
```

### 3. **Recovery Time Objective (RTO) Validation**

**Critical: Actually test your backups regularly**

**RTO testing checklist:**

1. **Perform actual restore tests** - Don't just assume backups work
   - Full restore to a test environment
   - Point-in-time recovery (PITR) test
   - Measure actual restore time vs your RTO requirement

2. **Calculate realistic restore times:**
   ```
   Estimated Restore Time = (Database Size / Restore Throughput) + WAL Replay Time + Verification Time
   ```

3. **Benchmark restore performance:**
   - Document baseline: "X GB database restores in Y minutes"
   - Test with different restore methods (pg_basebackup, pgBackRest, Barman, cloud snapshots)
   - Account for network transfer if restoring from remote storage

4. **Monitor WAL replay speed** during recovery:
   ```sql
   -- During recovery, check progress
   SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(),
          pg_last_xact_replay_timestamp();
   ```

### 4. **Practical Assessment Workflow**

**Week 1: Baseline measurement**
- Document current backup duration
- Measure current database size growth rate
- Calculate current storage costs

**Week 2-3: RTO validation**
- Schedule a restore test during maintenance window
- Measure actual restore time
- Document any gaps between actual RTO and requirements

**Week 4: Optimization**
- If backups are too slow: Consider parallel backup tools (pgBackRest with multiple workers)
- If storage costs are high: Tune retention policies, enable compression
- If RTO isn't met: Consider streaming replication for faster failover, or optimize restore process

### 5. **Key Metrics to Track Over Time**

Create a monitoring dashboard with:
- **Backup success rate** (should be 100%)
- **Backup duration trend** (watch for degradation)
- **Database size growth** (capacity planning)
- **WAL generation rate** (MB/hour)
- **Last successful restore test date** (should be < 30 days)
- **Actual vs target RTO** (gap analysis)

### 6. **Common Issues to Watch For**

Based on the Schema.ai knowledge base:
- **Long-running transactions** blocking backup completion
- **Disk space exhaustion** from WAL files during backup
- **Network bandwidth** limitations for remote backups
- **I/O contention** impacting production during backup windows

## Recommended Tools

- **pgBackRest** - Fast parallel backup/restore, compression, excellent for large databases
- **Barman** - Enterprise backup management, good for multiple PostgreSQL servers
- **pg_basebackup** - Built-in tool, good for smaller databases
- **Cloud-native tools** - AWS RDS snapshots, GCP Cloud SQL backups (vendor-specific performance characteristics)

## Next Steps

1. Run the database size queries above to establish your baseline
2. Schedule a restore test within the next 2 weeks
3. Document your findings: actual backup time, restore time, storage used
4. Compare against your RTO/RPO requirements
5. Identify and close any gaps

Would you like help setting up monitoring for any of these metrics, or assistance with planning a restore test?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When validating PostgreSQL backup and recovery readiness, start with the foundation: verify WAL archiving is healthy and not accumulating excessively, since this underpins point-in-time recovery capability. Then assess whether your backup duration fits your maintenance windows, test that you can actually meet your RTO, and finally optimize retention policies and storage costs.

1Verify WAL archiving is working reliably

First thing I check is pg_stat_archiver metrics — specifically `failed_count` and `last_archived_time`. If `failed_count` is non-zero or increasing, you have a critical problem that prevents point-in-time recovery. Check that `last_archived_time` is within the last few minutes; if it's stale, your archive destination may be unreachable or out of space. Even a single archive failure can create a gap in your recovery chain, so this must be zero under normal operation. The `archive-log-failures` insight will flag this as critical since it directly risks data loss in disaster scenarios.

Archive log failures risk data loss during point-in-time recovery WAL archiving healthy failed_countlast_archived_timearchived_count

2Check for excessive WAL accumulation

Next, look at `wal_bytes` generation rate and check the physical WAL directory size on disk. If you see many WAL files piling up (hundreds or thousands), it usually means archiving is too slow, a replica is lagging, or archive_command is failing silently. Without `wal_keep_size` or `max_slot_wal_keep_size` limits, this can fill your disk entirely and crash the database. Count files in pg_wal and correlate with `archived_count` — if archives are succeeding but WAL still accumulates, you have a replication lag issue.

Excessive WAL file accumulation fills disk when replication lags or archiving fails WAL archiving healthy wal_bytesarchived_countfailed_count

3Assess backup duration and throughput

Now evaluate whether backups complete within acceptable windows. Compare `backup-streamed` bytes to `postgresql-database-size` and `postgresql-total-size` to understand backup throughput (e.g., if a 500GB database takes 6 hours, you're getting ~23MB/s). If backup duration exceeds your maintenance window or causes replication lag, you need faster storage, parallel workers, or incremental backups. Also check `checkpoint-write-time` — long checkpoint times during base backups can indicate I/O bottlenecks.

backup duration acceptable for workload backup_streamedpostgresql.database.sizepostgresql.total_sizecheckpoint_write_time

4Validate recovery time objectives are achievable

The hard truth is you don't know your RTO until you test it. Calculate theoretical restore time: base backup restoration time (based on `postgresql-database-size` and restore throughput) plus WAL replay time (based on `wal_bytes` generated since backup start). For a 1TB database with 50GB of WAL, even at 100MB/s, that's 3+ hours. If your RTO is 1 hour, you need a different strategy — possibly hot standby, incremental backups, or partitioning. Document actual restore tests to prove the `recovery-time-objective-achievable` insight.

recovery time objective achievable backup and recovery readiness postgresql.database.sizewal_bytesbackup_streamed

5Review retention policy alignment and storage costs

Calculate actual storage costs: take `postgresql-total-size`, multiply by your retention period (e.g., 30 days of daily backups = 30x database size), then add WAL archive storage. If you're keeping 90 days of backups but compliance only requires 30, you're wasting 2x storage costs. Many teams set retention policies years ago and never revisit them. The `retention-policy-aligned-with-requirements` and `backup-costs-optimized` insights help identify this waste.

retention policy aligned with requirements backup costs optimized postgresql.total_sizewal_bytes

6Perform periodic backup validation tests

Finally, schedule regular restore drills — quarterly at minimum. Automate spinning up a test instance, restoring the latest backup, and running validation queries. I've seen too many teams discover corrupted backups only when disaster strikes. Time these drills and compare against your RTO. The `backup-validation-needed` and `backup-and-recovery-readiness` insights track whether you're actually testing your disaster recovery capability, not just assuming it works.

backup validation needed backup and recovery readiness