WAL Accumulation and Disk Space Crisis

criticalIncident Response

Write-Ahead Log (WAL) files are accumulating and consuming disk space, risking disk full condition.

Prompt: My PostgreSQL instance on RDS is running out of disk space and I see the WAL directory has grown to 200GB. Why isn't WAL archiving or cleanup happening and how do I fix this urgently?

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
1,187vs1,159
Latency
27.3smsvs25.0sms
Turns
2vs2
Tool Calls
0vs0
Schema Calls
0vs0
Response
2,846 charsvs2,906 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
The response with Schema provides more comprehensive diagnostic queries (4 vs 3), including an explicit archive status check via pg_stat_archiver that the response without Schema doesn't include. It also offers more RDS-specific guidance, mentioning CloudWatch logs for archive errors, S3 bucket permissions, and more detailed replication lag monitoring with lag_bytes calculations. Both responses cover the critical causes (replication slots, long-running transactions) with actionable SQL queries and emergency procedures, but the response with Schema provides better prioritization (e.g., '80% of cases') and mentions PostgreSQL 13+ features like max_slot_wal_keep_size that the other misses. The RDS integration is notably stronger in the response with Schema, which would help operators navigate AWS-specific considerations more effectively.
Without Schema
WAL files accumulating on RDS PostgreSQL is typically caused by **replication slots** or **archiving failures**. Here's how to diagnose and fix it urgently: ## Immediate Diagnosis Connect to your RDS instance and run these queries: ```sql -- Check for inactive/lagging replication slots (most common cause) SELECT slot_name, slot_type, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots ORDER BY restart_lsn; -- Check for long-running transactions SELECT pid, usename, state, state_change, now() - xact_start AS duration, query FROM pg_stat_activity WHERE xact_start IS NOT NULL ORDER BY xact_start; -- Check current WAL usage SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS current_wal_location; ``` ## Common Causes & Fixes ### 1. **Inactive Replication Slots** (Most Common) If you see slots with `active = false` or large `retained_wal`: ```sql -- Drop inactive/old slots (CAUTION: ensure they're not needed) SELECT pg_drop_replication_slot('slot_name_here'); ``` **Why this happens**: Logical replication slots (e.g., from CDC tools like Debezium, pglogical, or abandoned read replicas) prevent WAL cleanup until the consumer catches up. ### 2. **Lagging Read Replicas** Check replica lag in AWS Console or: ```sql -- On replica: SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag; ``` **Fix**: If a replica is hopelessly behind, consider rebuilding it or temporarily removing it. ### 3. **Archive Command Failures** (RDS-specific) RDS manages archiving automatically, but check: - Backup retention window in RDS settings - Any errors in RDS logs (CloudWatch Logs) ### 4. **Prevented by `wal_keep_size`/`wal_keep_segments`** Check your parameter group: ```sql SHOW wal_keep_size; -- PostgreSQL 13+ SHOW wal_keep_segments; -- PostgreSQL 12 and earlier ``` ## Immediate Emergency Actions If you're at >90% disk: 1. **Increase storage immediately** (RDS allows online resize): ```bash aws rds modify-db-instance \ --db-instance-identifier your-db \ --allocated-storage NEW_SIZE \ --apply-immediately ``` 2. **Drop unused replication slots** (after verifying they're not needed) 3. **Force checkpoint** (limited help, but worth trying): ```sql CHECKPOINT; ``` 4. **Terminate long-running queries** (if found): ```sql SELECT pg_terminate_backend(pid); ``` ## Prevention - **Monitor replication slots**: Set up CloudWatch alarms on `OldestReplicationSlotLag` - **Set `max_slot_wal_keep_size`** (PostgreSQL 13+) to prevent unbounded WAL growth - **Regular slot cleanup**: Remove slots from decommissioned consumers - **Monitor disk space**: Alert at 70-80% usage What do those diagnostic queries show? That will tell us exactly which issue you're hitting.
With Schema
I'll help you diagnose and fix this WAL accumulation issue urgently. Here are the most common causes and solutions for RDS PostgreSQL: ## **Immediate Diagnostics** Run these queries to identify the root cause: ```sql -- 1. Check for inactive replication slots (MOST COMMON CAUSE) SELECT slot_name, slot_type, database, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots ORDER BY restart_lsn; -- 2. Check for long-running transactions SELECT pid, usename, application_name, state, now() - xact_start AS transaction_age, now() - state_change AS state_age FROM pg_stat_activity WHERE state != 'idle' AND xact_start IS NOT NULL ORDER BY xact_start; -- 3. Check archive status SELECT archived_count, failed_count, stats_reset FROM pg_stat_archiver; -- 4. Check current WAL usage SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_written; ``` ## **Most Likely Causes & Fixes** ### **1. Inactive Replication Slots** (80% of cases) If you see inactive slots in the first query: ```sql -- DROP inactive slots immediately SELECT pg_drop_replication_slot('slot_name_here'); ``` **RDS-specific:** Check for Read Replicas that were deleted but left slots behind. ### **2. Long-Running Transactions** If you see transactions running for hours/days: ```sql -- Terminate blocking transaction SELECT pg_terminate_backend(pid_here); ``` ### **3. Archive Command Failures** Check RDS logs in CloudWatch for archive errors. Common fix: - Verify S3 bucket permissions (if using custom archiving) - Check `archive_timeout` parameter ### **4. Replication Lag** ```sql -- Check lag on standby/read replicas SELECT client_addr, state, sync_state, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes FROM pg_stat_replication; ``` If lag is huge, consider temporarily removing the replica. ## **Emergency Space Recovery** If you need immediate space: 1. **Increase RDS storage** (fastest, safest): ```bash aws rds modify-db-instance \ --db-instance-identifier your-db \ --allocated-storage NEW_SIZE \ --apply-immediately ``` 2. **Force checkpoint**: ```sql CHECKPOINT; ``` 3. **Adjust WAL retention** (RDS parameter group): - Set `rds.logical_replication = 0` (if not needed) - Lower `wal_keep_size` or `wal_keep_segments` ## **RDS-Specific Commands** Check current WAL parameters: ```sql SHOW wal_keep_size; -- or wal_keep_segments for older versions SHOW max_wal_size; SHOW archive_mode; ``` ## **Prevent Future Issues** 1. Monitor replication slot lag regularly 2. Set up CloudWatch alarms for: - Free storage space < 20% - Oldest replication slot lag > threshold 3. Use `max_slot_wal_keep_size` to prevent runaway growth (PostgreSQL 13+) **What's your first diagnostic query result?** That will tell us exactly which fix to apply.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When WAL files accumulate dangerously on RDS PostgreSQL, the most common culprit is inactive replication slots preventing cleanup — check those first. Next, verify your archiving process isn't silently failing by examining pg_stat_archiver metrics. Only after ruling out slots and archiving should you investigate replication lag or configuration issues like wal_keep_size.

1Check for inactive or stuck replication slots
This is the #1 cause of WAL accumulation in my experience. Run `SELECT slot_name, active, catalog_xmin, restart_lsn FROM pg_replication_slots;` and look for slots where `active = false` or where `catalog_xmin` is set (indicates a logical replication slot). Inactive slots prevent WAL cleanup indefinitely — a single forgotten slot can consume hundreds of GB. If you find unused slots, drop them immediately with `SELECT pg_drop_replication_slot('slot_name');` and you should see disk space recover within minutes as PostgreSQL recycles old WAL files.
2Verify WAL archiving is functioning
Query `pg_stat_archiver` and compare `archived_count` versus `failed_count`. If `failed_count` is climbing, your archive command is failing silently and WAL files are piling up because PostgreSQL won't delete them until they're successfully archived. On RDS, check CloudWatch logs for archive-related errors and verify your S3 bucket permissions, disk space on the archive destination, or network connectivity. A failing archive command is a point-in-time recovery risk as well as a disk space crisis.
3Measure replication lag from primary to replicas
Check `replication_lag_seconds` for all replicas — if any replica is hours or days behind, the primary must retain WAL files until the replica catches up. On the primary, query `SELECT client_addr, state, sync_state, replay_lag FROM pg_stat_replication;` to see lag per replica. If a replica is stuck or lagging badly (e.g., >30 minutes), you may need to rebuild it from a fresh snapshot rather than wait for it to catch up, or temporarily remove it from replication to allow WAL cleanup.
4Review wal_keep_size configuration
Check `SHOW wal_keep_size;` — if this is set to a large value (e.g., 50GB or more), PostgreSQL is explicitly retaining that much WAL even when it's not needed. On RDS, this is sometimes set conservatively high to protect replicas during brief network issues, but if you're not actually using streaming replication or your replicas are healthy, you can reduce it. Compare `wal_keep_size` to your actual `wal_bytes` generation rate to see if it's appropriate for your workload.
5Take immediate action if disk is critically full
If disk usage is above 90%, you're in emergency territory — once disk is 100% full, PostgreSQL can't write WAL and will shut down or reject transactions. On RDS, you can't manually delete WAL files, but you can temporarily increase storage (RDS allows online resize) to buy time while you fix the root cause identified in steps 1-4. Monitor `postgresql.wal_size` closely and expect it to drop significantly once you remove inactive slots or fix archiving failures.

Technologies

Related Insights

replication slot preventing WAL cleanup
Stub insight created from scenario discovery hint: replication slot preventing WAL cleanup
wal_keep_size set too high
Stub insight created from scenario discovery hint: wal_keep_size set too high
WAL accumulation causing disk space crisis
diagnostic_step
Stub insight created from scenario discovery hint: WAL accumulation causing disk space crisis
archive command failing
prerequisite
Stub insight created from scenario discovery hint: archive command failing
Inactive replication slots prevent autovacuum from removing dead tuples cluster-wide
prerequisitecritical
Disk full prevents WAL file writes causing transaction failures and server shutdown
outcomecritical
disk full risk
threshold_indicator
Stub insight created from scenario discovery hint: disk full risk
WAL archiving healthy
supporting_evidence
Stub insight created from scenario discovery hint: WAL archiving healthy
WAL Accumulation and Disk Space Crisis
diagnostic_step
Stub insight created from scenario discovery hint: WAL Accumulation and Disk Space Crisis
Archive log failures risk data loss during point-in-time recovery
prerequisitecritical
Excessive WAL file accumulation fills disk when replication lags or archiving fails
prerequisitecritical

Relevant Metrics

Monitoring Interfaces

PostgreSQL CloudWatch Metrics
PostgreSQL Datadog
PostgreSQL Prometheus
PostgreSQL Native
PostgreSQL Prometheus Exporter (pg_exporter)