Disk space exhaustion from WAL files

criticalIncident Response

Write-Ahead Log directory growing uncontrolled due to archiving failures or replication issues, threatening database availability

Prompt: My PostgreSQL pg_wal directory is consuming 80% of available disk space and growing. I have replication configured—how do I safely clean this up without breaking replication or causing data loss?

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
1,273vs1,355
Latency
25.4smsvs24.6sms
Turns
2vs2
Tool Calls
0vs0
Schema Calls
0vs0
Response
3,427 charsvs3,755 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses provide comprehensive diagnostic queries and actionable cleanup procedures. The response with Schema includes a more explicit emergency procedure section with bash commands for critical disk space situations, adds the pg_ls_waldir() diagnostic query for direct WAL size checking, demonstrates the pg_replication_slot_advance() technique, and provides a monitoring view for ongoing prevention. The response without Schema has a clearer consolidated safety warnings section and is well-organized, but lacks the emergency handling depth and monitoring code. Both are technically accurate, with the Schema-enhanced version offering marginally better completeness in crisis scenarios and long-term prevention.
Without Schema
I'll help you safely address this PostgreSQL WAL growth issue. Let me provide a diagnostic and cleanup approach that protects your replication setup. ## First, let's diagnose the root cause: **1. Check replication slot status:** ```sql SELECT slot_name, slot_type, active, restart_lsn, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal FROM pg_replication_slots; ``` **2. Check replication lag:** ```sql -- On primary: SELECT client_addr, state, sync_state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sending_lag, pg_wal_lsn_diff(sent_lsn, write_lsn) AS receiving_lag, pg_wal_lsn_diff(write_lsn, flush_lsn) AS flushing_lag FROM pg_stat_replication; ``` **3. Check archive status (if archiving is configured):** ```sql SELECT archived_count, failed_count, last_failed_time, last_failed_wal FROM pg_stat_archiver; ``` ## Common causes and safe fixes: ### **Cause 1: Inactive/Dead Replication Slot** If you find an inactive slot that's retaining WAL: ```sql -- ONLY if the slot is truly dead and won't be used: SELECT pg_drop_replication_slot('slot_name'); ``` ⚠️ **Warning**: Only drop slots for replicas you've permanently removed. ### **Cause 2: Replication Lag** If replicas are falling behind: - **Check network connectivity** to replicas - **Check replica performance** (CPU, disk I/O) - **Consider temporarily removing slow replicas** if critical ### **Cause 3: Archive Command Failures** If `failed_count` is high: ```bash # Check PostgreSQL logs for archive failures tail -f /var/log/postgresql/postgresql-*.log | grep archive ``` - Fix the archive destination (permissions, disk space, connectivity) - Once fixed, PostgreSQL will automatically clean up archived WAL ### **Cause 4: Configuration Issues** Check these settings in `postgresql.conf`: ```sql SHOW wal_keep_size; -- or wal_keep_segments on older versions SHOW max_slot_wal_keep_size; SHOW archive_mode; SHOW archive_command; ``` ## Safe cleanup steps: **Option A: Let PostgreSQL clean up naturally (SAFEST)** ```sql -- Once you've fixed the root cause, force a checkpoint: CHECKPOINT; -- PostgreSQL will automatically remove old WAL files -- that are no longer needed ``` **Option B: Temporarily remove problematic replica (if critical)** ```sql -- 1. Stop the replica's PostgreSQL service -- 2. Drop its replication slot on primary SELECT pg_drop_replication_slot('replica_slot_name'); -- 3. After cleanup, rebuild the replica using pg_basebackup ``` **Option C: Increase disk space temporarily** while you investigate (if possible) ## What NOT to do: ❌ **Never manually delete files from pg_wal** - this causes corruption ❌ **Never drop active replication slots** - breaks replication ❌ **Never use `pg_resetwal`** unless recovering from catastrophic failure ## Immediate action plan: 1. **Run the diagnostic queries above** to identify the cause 2. **Check if any slots are inactive** (`active = false`) for dead replicas 3. **Verify archive_command is working** (if archiving is enabled) 4. **Force a checkpoint** after fixing issues: `CHECKPOINT;` 5. **Monitor WAL size** over the next hour to confirm cleanup Would you like me to help you create a script to diagnose your specific situation? I can write a comprehensive check script that you can run against your PostgreSQL instance to identify the exact cause.
With Schema
I'll help you safely address this PostgreSQL WAL accumulation issue. This is commonly caused by replication slots retaining WAL files for lagging replicas. Let me guide you through diagnosis and safe cleanup. ## First, diagnose the root cause Run these diagnostic queries to understand what's happening: ```sql -- Check replication slot status and how far behind they are SELECT slot_name, slot_type, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal FROM pg_replication_slots; -- Check current WAL directory size SELECT pg_size_pretty(sum(size)) as wal_size FROM pg_ls_waldir(); -- Check if archiving is working (if enabled) SELECT archived_count, failed_count, last_failed_time FROM pg_stat_archiver; -- Check replication lag SELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) as send_lag, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as replay_lag FROM pg_stat_replication; ``` ## Safe cleanup steps **Option 1: If you have inactive/stale replication slots** (most common cause) ```sql -- List all replication slots SELECT * FROM pg_replication_slots; -- Drop inactive slots that are no longer needed -- ⚠️ ONLY if you're sure the standby is decommissioned SELECT pg_drop_replication_slot('slot_name_here'); ``` **Option 2: If replicas are lagging but still needed** ```bash # On the standby server(s), check if replication is actually running # If the standby is down or stuck, restart it # Check PostgreSQL logs on standby for errors tail -f /var/log/postgresql/postgresql-*.log # Restart the standby if needed sudo systemctl restart postgresql ``` **Option 3: Temporarily advance stuck slots** (use with caution) ```sql -- Only if a replica is permanently unavailable and you need immediate relief -- This will cause that replica to need resyncing SELECT pg_replication_slot_advance('slot_name', pg_current_wal_lsn()); ``` **Option 4: Adjust WAL retention settings** ```sql -- Check current settings SHOW wal_keep_size; -- or wal_keep_segments in older versions SHOW max_wal_size; -- Reduce retention if set too high (requires restart) ALTER SYSTEM SET wal_keep_size = '2GB'; -- adjust as needed SELECT pg_reload_conf(); ``` ## Emergency procedure if disk is critically full If you're at immediate risk of running out of disk space: ```bash # 1. Stop new connections sudo -u postgres psql -c "ALTER SYSTEM SET max_connections = 10;" sudo systemctl reload postgresql # 2. Identify the problematic slot sudo -u postgres psql -c "SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots ORDER BY restart_lsn;" # 3. Drop the inactive slot OR advance it sudo -u postgres psql -c "SELECT pg_drop_replication_slot('problem_slot');" # 4. Force a checkpoint to clean up WAL sudo -u postgres psql -c "CHECKPOINT;" # 5. Restore connections sudo -u postgres psql -c "ALTER SYSTEM RESET max_connections;" sudo systemctl reload postgresql ``` ## Prevention going forward Add monitoring: ```sql -- Create a monitoring query to alert when WAL grows large CREATE OR REPLACE VIEW wal_monitoring AS SELECT (SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir()) as current_wal_size, (SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) FROM pg_replication_slots WHERE restart_lsn IS NOT NULL ORDER BY restart_lsn LIMIT 1) as oldest_slot_retention; ``` **What information would be most helpful?** - What does your `pg_replication_slots` query show? - Are your replicas actively connected? - Do you have archiving enabled (`archive_mode = on`)? This will help me give you more specific guidance for your situation.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating uncontrolled WAL directory growth with replication, start by checking for orphaned or inactive replication slots — the most common cause of runaway WAL retention. Then verify archive command health and replication lag before reviewing WAL retention settings. Only after identifying the root cause should you attempt cleanup to avoid breaking replication or losing data.

1Check for orphaned or inactive replication slots
Query `pg_replication_slots` to identify slots with `active = false` or where `restart_lsn` is far behind current WAL position — these force PostgreSQL to retain WAL indefinitely. Check `replication_slot_wal_status` for 'reserved' or 'extended' status and `replication_slot_safe_wal_size` to see how much WAL is being held. In my experience, orphaned slots from old replicas or failed logical replication consumers are the #1 cause of 100+ GB WAL bloat. Drop any slots that are no longer needed before doing anything else.
2Verify archive command health
Check `pg_stat_archiver` for failures — if `postgresql.archiver.failed_count` is rising or `last_failed_time` is recent while `last_archived_time` is stale, your archive process is broken and WAL cannot be cleaned up. Compare `postgresql.archiver.archived_count` growth rate to WAL generation rate (`postgresql.wal.bytes`) to confirm archiving is keeping pace. Archive failures are the second most common cause of WAL bloat — verify your archive destination has space, proper permissions, and network connectivity before attempting any cleanup.
3Check replication lag on all standby servers
Query `pg_stat_replication` to check `replay_lag` and `write_lag` for all connected replicas — if any replica is hours or days behind, PostgreSQL must retain WAL until it catches up. Check `postgresql.wal.age` to see how old your oldest WAL file is — if it's days old, you have a lagging or stuck replica. A single slow replica with `wal_sender_timeout` set too high can cause the entire WAL directory to grow until disk is exhausted.
4Quantify WAL directory bloat
Run `du -sh /var/lib/postgresql/*/pg_wal/` to get total WAL directory size and `ls /var/lib/postgresql/*/pg_wal/ | wc -l` to count files — healthy systems typically keep 20-100 WAL files (3-16 GB at default 16 MB per file), while bloated systems can accumulate thousands of files consuming hundreds of GB. Compare `postgresql.wal_size` to your database size (`postgresql.database.size`) — if WAL is 5-10x larger than your actual database, you definitely have a retention problem from slots, archiving, or replication lag identified in previous steps.
5Review WAL retention configuration
Check your `wal_keep_size` (or `wal_keep_segments` in older versions) and `max_wal_size` settings in `postgresql.conf` — if `wal_keep_size` is set to an enormous value or unlimited, adjust it to something reasonable like 5-10 GB to prevent future bloat. For replication slots specifically, consider setting `max_slot_wal_keep_size` to limit retention even when slots are inactive, which will cause replication to fail rather than filling your disk. This setting provides a critical safety valve when replication consumers disappear unexpectedly.
6Plan safe cleanup after fixing root cause
Only after fixing the root cause (dropping orphaned slots, fixing archive command, or resolving replication lag) should you attempt cleanup — PostgreSQL will automatically remove old WAL files during checkpoints once nothing is holding them. If you're above 90% disk and need emergency space, you can manually remove old WAL files, but ONLY ones older than the oldest `restart_lsn` in `pg_replication_slots` and the last archived WAL shown in `pg_stat_archiver`. Never delete WAL files while the underlying problem remains unfixed, or you'll just fill the disk again within hours.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

PostgreSQL Prometheus
PostgreSQL Datadog
PostgreSQL Native
PostgreSQL CloudWatch Metrics