Backup and Recovery Performance

infoProactive Health

Need to validate backup strategy and understand recovery time objectives are being met.

Prompt: ā€œI need to verify our PostgreSQL backup strategy is working well. How do I assess backup performance, storage costs, and whether we can actually meet our recovery time objectives?ā€

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
10,915vs3,119
Latency
2.8mmsvs1.2mms
Turns
13vs23
Tool Calls
7vs10
Schema Calls
0vs7
Response
1,953 charsvs5,529 chars
Evaluation
Without Schema
B-
vs
With Schema
A-
The response without Schema creates automated scripts but doesn't show their content inline, offering a toolkit approach that appears generic rather than PostgreSQL-specific. The response with Schema provides significantly deeper technical guidance with actual SQL queries visible (`postgresql.database.size`, pg_stat_user_tables, WAL replay monitoring), specific tool recommendations (pgBackRest, Barman with performance characteristics), and a practical week-by-week workflow. The Schema-enhanced response demonstrates PostgreSQL expertise through specific metrics, common issues from the knowledge base, and RTO calculation formulas, while the without-Schema response focuses on automation without showing technical depth. Both are actionable, but the with-Schema response would more reliably guide an operator toward PostgreSQL best practices with concrete technical details.
Without Schema
7 tool calls
I'll help you assess your PostgreSQL backup strategy. Let me first check what backup setup you currently have in place. I don't see any existing backup configuration in your directory. Let me create a comprehensive PostgreSQL backup assessment toolkit that will help you evaluate all three areas you mentioned. I've created a comprehensive PostgreSQL backup assessment toolkit for you! Here's what I've prepared: ## šŸ“¦ Three Key Files: 1. **`backup_assessment.sh`** - Automated testing script that: - Measures backup performance (pg_dump, compressed, physical backups) - Calculates storage costs for different retention strategies - Tests actual recovery time (RTO) by performing real restores - Provides health check against best practices 2. **`backup_monitoring.sql`** - Production monitoring queries for: - Database growth tracking - WAL archiving status - Backup window analysis - Point-in-time recovery capability checks - Storage optimization opportunities 3. **`README.md`** - Complete documentation with: - Quick start guide - Metric interpretation - Best practices checklist - Common issues & solutions - Automation examples ## šŸŽÆ How to Use: ```bash # 1. Set your database connection export DB_HOST="localhost" export DB_NAME="your_database" export DB_USER="postgres" # 2. Run complete assessment chmod +x backup_assessment.sh ./backup_assessment.sh all ``` This will give you concrete answers to your three questions: - **Performance**: Actual throughput (MB/s), duration, and compression ratios - **Costs**: Storage requirements and monthly costs for S3 tiers - **RTO**: Real recovery time from actual restore test The script rates your RTO as Excellent (<15 min), Good (<1 hr), Acceptable (<4 hrs), or Poor (>4 hrs) based on actual measurements, not estimates. Would you like me to help you customize this for a specific database, or would you like guidance on interpreting the results?
With Schema
10 tool calls7 Schema
I'll help you assess your PostgreSQL backup strategy. Let me get expert guidance on backup performance, storage optimization, and RTO validation. Let me get the detailed scenario information along with key metrics and insights: Let me search for more specific backup and recovery documentation: Based on the expert guidance from Schema.ai's PostgreSQL knowledge base, here's how to assess your backup strategy comprehensively: ## Key Assessment Areas ### 1. **Backup Performance Assessment** **Primary metric to track:** `postgresql.database.size` - This determines backup duration and is critical for estimating recovery time objectives. **What to measure:** - **Backup duration trends** - Track how long backups take over time as your database grows - **Backup throughput** - Calculate MB/s or GB/hour to understand if performance degrades - **I/O impact** - Monitor disk I/O and CPU during backup windows to ensure production isn't affected **Commands to run:** ```sql -- Get current database size SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database ORDER BY pg_database_size(pg_database.datname) DESC; -- Get individual table sizes (largest tables) SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size, pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS index_size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 20; ``` ### 2. **Storage Cost Optimization** **Key factors:** - **Compression ratios** - Modern backup tools (pgBackRest, Barman) offer compression; measure actual vs uncompressed sizes - **Incremental vs full backups** - Track the differential between full and incremental backup sizes - **Retention policy efficiency** - Balance RTO/RPO requirements against storage costs - **WAL archive growth** - Monitor WAL generation rate for PITR capability **Calculate storage costs:** ```sql -- Estimate daily data change rate SELECT schemaname, tablename, n_tup_ins + n_tup_upd + n_tup_del as daily_changes, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as current_size FROM pg_stat_user_tables WHERE n_tup_ins + n_tup_upd + n_tup_del > 0 ORDER BY daily_changes DESC LIMIT 20; ``` ### 3. **Recovery Time Objective (RTO) Validation** **Critical: Actually test your backups regularly** **RTO testing checklist:** 1. **Perform actual restore tests** - Don't just assume backups work - Full restore to a test environment - Point-in-time recovery (PITR) test - Measure actual restore time vs your RTO requirement 2. **Calculate realistic restore times:** ``` Estimated Restore Time = (Database Size / Restore Throughput) + WAL Replay Time + Verification Time ``` 3. **Benchmark restore performance:** - Document baseline: "X GB database restores in Y minutes" - Test with different restore methods (pg_basebackup, pgBackRest, Barman, cloud snapshots) - Account for network transfer if restoring from remote storage 4. **Monitor WAL replay speed** during recovery: ```sql -- During recovery, check progress SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(); ``` ### 4. **Practical Assessment Workflow** **Week 1: Baseline measurement** - Document current backup duration - Measure current database size growth rate - Calculate current storage costs **Week 2-3: RTO validation** - Schedule a restore test during maintenance window - Measure actual restore time - Document any gaps between actual RTO and requirements **Week 4: Optimization** - If backups are too slow: Consider parallel backup tools (pgBackRest with multiple workers) - If storage costs are high: Tune retention policies, enable compression - If RTO isn't met: Consider streaming replication for faster failover, or optimize restore process ### 5. **Key Metrics to Track Over Time** Create a monitoring dashboard with: - **Backup success rate** (should be 100%) - **Backup duration trend** (watch for degradation) - **Database size growth** (capacity planning) - **WAL generation rate** (MB/hour) - **Last successful restore test date** (should be < 30 days) - **Actual vs target RTO** (gap analysis) ### 6. **Common Issues to Watch For** Based on the Schema.ai knowledge base: - **Long-running transactions** blocking backup completion - **Disk space exhaustion** from WAL files during backup - **Network bandwidth** limitations for remote backups - **I/O contention** impacting production during backup windows ## Recommended Tools - **pgBackRest** - Fast parallel backup/restore, compression, excellent for large databases - **Barman** - Enterprise backup management, good for multiple PostgreSQL servers - **pg_basebackup** - Built-in tool, good for smaller databases - **Cloud-native tools** - AWS RDS snapshots, GCP Cloud SQL backups (vendor-specific performance characteristics) ## Next Steps 1. Run the database size queries above to establish your baseline 2. Schedule a restore test within the next 2 weeks 3. Document your findings: actual backup time, restore time, storage used 4. Compare against your RTO/RPO requirements 5. Identify and close any gaps Would you like help setting up monitoring for any of these metrics, or assistance with planning a restore test?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When validating PostgreSQL backup and recovery readiness, start with the foundation: verify WAL archiving is healthy and not accumulating excessively, since this underpins point-in-time recovery capability. Then assess whether your backup duration fits your maintenance windows, test that you can actually meet your RTO, and finally optimize retention policies and storage costs.

1Verify WAL archiving is working reliably
First thing I check is pg_stat_archiver metrics — specifically `failed_count` and `last_archived_time`. If `failed_count` is non-zero or increasing, you have a critical problem that prevents point-in-time recovery. Check that `last_archived_time` is within the last few minutes; if it's stale, your archive destination may be unreachable or out of space. Even a single archive failure can create a gap in your recovery chain, so this must be zero under normal operation. The `archive-log-failures` insight will flag this as critical since it directly risks data loss in disaster scenarios.
2Check for excessive WAL accumulation
Next, look at `wal_bytes` generation rate and check the physical WAL directory size on disk. If you see many WAL files piling up (hundreds or thousands), it usually means archiving is too slow, a replica is lagging, or archive_command is failing silently. Without `wal_keep_size` or `max_slot_wal_keep_size` limits, this can fill your disk entirely and crash the database. Count files in pg_wal and correlate with `archived_count` — if archives are succeeding but WAL still accumulates, you have a replication lag issue.
3Assess backup duration and throughput
Now evaluate whether backups complete within acceptable windows. Compare `backup-streamed` bytes to `postgresql-database-size` and `postgresql-total-size` to understand backup throughput (e.g., if a 500GB database takes 6 hours, you're getting ~23MB/s). If backup duration exceeds your maintenance window or causes replication lag, you need faster storage, parallel workers, or incremental backups. Also check `checkpoint-write-time` — long checkpoint times during base backups can indicate I/O bottlenecks.
4Validate recovery time objectives are achievable
The hard truth is you don't know your RTO until you test it. Calculate theoretical restore time: base backup restoration time (based on `postgresql-database-size` and restore throughput) plus WAL replay time (based on `wal_bytes` generated since backup start). For a 1TB database with 50GB of WAL, even at 100MB/s, that's 3+ hours. If your RTO is 1 hour, you need a different strategy — possibly hot standby, incremental backups, or partitioning. Document actual restore tests to prove the `recovery-time-objective-achievable` insight.
5Review retention policy alignment and storage costs
Calculate actual storage costs: take `postgresql-total-size`, multiply by your retention period (e.g., 30 days of daily backups = 30x database size), then add WAL archive storage. If you're keeping 90 days of backups but compliance only requires 30, you're wasting 2x storage costs. Many teams set retention policies years ago and never revisit them. The `retention-policy-aligned-with-requirements` and `backup-costs-optimized` insights help identify this waste.
6Perform periodic backup validation tests
Finally, schedule regular restore drills — quarterly at minimum. Automate spinning up a test instance, restoring the latest backup, and running validation queries. I've seen too many teams discover corrupted backups only when disaster strikes. Time these drills and compare against your RTO. The `backup-validation-needed` and `backup-and-recovery-readiness` insights track whether you're actually testing your disaster recovery capability, not just assuming it works.

Technologies

Related Insights

backup duration acceptable for workload
Stub insight created from scenario discovery hint: backup duration acceptable for workload
recovery time objective achievable
Stub insight created from scenario discovery hint: recovery time objective achievable
WAL archiving healthy
Stub insight created from scenario discovery hint: WAL archiving healthy
retention policy aligned with requirements
Stub insight created from scenario discovery hint: retention policy aligned with requirements
Archive log failures risk data loss during point-in-time recovery
diagnostic_stepcritical
backup validation needed
diagnostic_step
Stub insight created from scenario discovery hint: backup validation needed
backup and recovery readiness
diagnostic_step
Stub insight created from scenario discovery hint: backup and recovery readiness
backup costs optimized
supporting_evidence
Stub insight created from scenario discovery hint: backup costs optimized
Excessive WAL file accumulation fills disk when replication lags or archiving fails
outcomecritical

Relevant Metrics

Monitoring Interfaces

PostgreSQL CloudWatch Metrics
PostgreSQL Datadog
PostgreSQL Native
PostgreSQL Prometheus Exporter (pg_exporter)
PostgreSQL Datadog Integration