Disk space exhaustion from WAL files
criticalIncident Response
Write-Ahead Log directory growing uncontrolled due to archiving failures or replication issues, threatening database availability
Prompt: “My PostgreSQL pg_wal directory is consuming 80% of available disk space and growing. I have replication configured—how do I safely clean this up without breaking replication or causing data loss?”
With vs Without Schema
Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.
Tokens
1,801vs2,518
Latency
33.1smsvs55.4sms
Turns
5vs22
Tool Calls
1vs10
Schema Calls
0vs6
Response
3,331 charsvs3,102 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
The response without Schema provides comprehensive coverage of multiple potential causes (inactive slots, replication lag, archive failures, wal_keep_size settings) with extensive diagnostic queries. However, it's somewhat broad and doesn't clearly prioritize the most common culprit. The response with Schema is more focused and strategic, immediately homing in on replication slots as the primary cause and providing a clearer decision framework with explicit verification criteria. It also cites official PostgreSQL documentation sources and mentions newer preventive features like max_slot_wal_keep_size. Both responses require diagnostic output to proceed, but the Schema-augmented response provides a more streamlined path to resolution with authoritative backing.
Without Schema
1 tool calls
I'll help you safely address this pg_wal growth issue. This is typically caused by WAL files not being cleaned up due to replication lag or archiving issues. Let's diagnose and fix it carefully.
## First, let's diagnose the issue:
Let me provide you with the diagnostic commands and cleanup steps directly:
## **Step 1: Diagnose the Root Cause**
Run these commands to identify why WAL is accumulating:
```bash
# Check replication status
psql -U postgres -c "SELECT * FROM pg_stat_replication;"
# Check replication slots (CRITICAL - this is often the culprit)
psql -U postgres -c "SELECT slot_name, slot_type, active, restart_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;"
# Check archive status
psql -U postgres -c "SELECT archived_count, failed_count, last_archived_wal, last_failed_wal
FROM pg_stat_archiver;"
# Check WAL settings
psql -U postgres -c "SHOW wal_keep_size; SHOW max_wal_size; SHOW archive_command;"
```
## **Common Causes & Solutions:**
### **1. Inactive Replication Slot (Most Common)**
If you see an **inactive** replication slot with large `retained_wal`:
```sql
-- Find inactive slots
SELECT slot_name, active, retained_wal FROM pg_replication_slots WHERE NOT active;
-- If the replica is permanently gone, DROP the slot:
SELECT pg_drop_replication_slot('slot_name_here');
```
### **2. Replication Lag**
If replicas are active but lagging:
```sql
-- Check lag on each replica
SELECT client_addr, state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```
**Fix**: Investigate why replicas are slow (network, disk I/O, long queries)
### **3. Archive Command Failures**
If `failed_count` is high in `pg_stat_archiver`:
- Check `last_failed_wal` for error details
- Verify archive destination has space
- Test archive_command manually
### **4. wal_keep_size Too High**
If set excessively high:
```sql
-- Reduce it (requires reload, not restart)
ALTER SYSTEM SET wal_keep_size = '2GB';
SELECT pg_reload_conf();
```
## **Step 2: Safe Cleanup**
**DO NOT manually delete files from pg_wal!** PostgreSQL manages this automatically once the root cause is fixed.
**Safe approach:**
1. **Fix the root cause** (drop inactive slots, fix archive issues)
2. **Force a checkpoint** to trigger cleanup:
```sql
CHECKPOINT;
```
3. **Wait 5-10 minutes** - PostgreSQL will auto-remove old WAL files
4. **Verify**:
```bash
watch -n 10 'du -sh /var/lib/postgresql/data/pg_wal'
```
## **Emergency: If Disk is 95%+ Full**
Only if critically full and you've confirmed no active replicas need the WAL:
```sql
-- Temporarily reduce wal_keep_size
ALTER SYSTEM SET wal_keep_size = '512MB';
SELECT pg_reload_conf();
CHECKPOINT;
```
**Monitor** your replicas closely after this to ensure they don't break.
## **Prevention:**
```sql
-- Set up slot-based replication with max_slot_wal_keep_size (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
-- Monitor with alerts
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn))
FROM pg_replication_slots WHERE NOT active;
```
**What's the output of the diagnostic queries?** That will tell us exactly what's causing your WAL accumulation, and I can provide specific next steps.
With Schema
10 tool calls6 Schema
I'll get you expert guidance on safely cleaning up your PostgreSQL WAL directory while preserving replication.
Your pg_wal bloat is almost certainly caused by **inactive or lagging replication slots** preventing WAL file cleanup. Here's how to safely resolve this:
## Immediate Diagnostics
**1. Check replication slot status:**
```sql
SELECT slot_name,
slot_type,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
age(xmin) AS xmin_age,
age(catalog_xmin) AS catalog_xmin_age
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
```
This shows which slots are retaining WAL and how much disk space they're holding.
**2. Cross-reference with active replication:**
```sql
SELECT pid, client_hostname, state, backend_xmin
FROM pg_stat_replication
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;
```
## Safe Cleanup Strategy
**If you find inactive slots (active = false):**
These are slots for replicas that are disconnected or no longer exist. They're the primary cause of WAL bloat.
⚠️ **Before dropping a slot**, verify:
- Is the replica permanently gone? → Safe to drop
- Is it temporarily down but will reconnect? → **Do NOT drop** (replica will need full rebuild)
- Is it for a downstream system you no longer use? → Safe to drop
**To drop an inactive slot:**
```sql
SELECT pg_drop_replication_slot('slot_name_here');
```
**After dropping slots**, WAL files will be cleaned up automatically during the next checkpoint cycle.
**If slots are active but lagging significantly:**
The replica is connected but falling behind. Check:
- Network issues between primary and replica
- Replica resource constraints (CPU, disk I/O)
- Large transactions blocking replication apply
## Preventive Configuration
Set limits to auto-invalidate problematic slots (PostgreSQL 13+):
```sql
-- Limit WAL retention to 100GB per slot
ALTER SYSTEM SET max_slot_wal_keep_size = '100GB';
SELECT pg_reload_conf();
```
This prevents any single slot from consuming unlimited disk space.
## Emergency Disk Space Recovery
If you're critically low on disk and need immediate relief:
**DO NOT** manually delete files from pg_wal — this causes corruption.
Instead:
1. Drop inactive slots first (safest)
2. Consider temporarily dropping less-critical active slots (requires replica rebuild)
3. Force a checkpoint: `CHECKPOINT;`
4. Monitor: `SELECT pg_ls_waldir() ORDER BY modification DESC;`
## Critical Warning
Dropping an active replication slot **breaks that replica's connection**. The replica will need to be rebuilt from a fresh base backup using `pg_basebackup`. Only drop active slots if you're prepared to rebuild.
**Source documentation:**
- [PostgreSQL Routine Vacuuming](https://www.postgresql.org/docs/current/routine-vacuuming.html)
- [Debugging autovacuum problems](https://www.citusdata.com/blog/2022/07/28/debugging-postgres-autovacuum-problems-13-tips)
Run the diagnostic queries first and share the output if you need help deciding which slots are safe to drop.
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When diagnosing WAL directory bloat in PostgreSQL with replication, start by identifying inactive or orphaned replication slots — the most common cause of unbounded WAL retention. Then check whether archiving is failing or active replicas are severely lagged. Only after understanding the root cause can you safely clean up WAL files without breaking replication or losing data.
1Check for inactive or orphaned replication slots
Query `pg_replication_slots` to identify slots that are inactive or have consumers that never reconnected. Check `replication_slot_wal_status` — if it shows 'reserved' or 'extended', that slot is holding WAL files. Look at `replication_slot_restart_delay_bytes` to see how much WAL each slot is retaining. In my experience, this is the #1 cause of WAL bloat: a slot created for a replica or logical subscription that was decommissioned without dropping the slot, causing WAL to accumulate indefinitely (sometimes 900+ GB as seen in the `wal-directory-bloat-orphaned-slots` insight).
2Check archive command status and failures
Query `pg_stat_archiver` and compare `archived_count` versus `failed_count` — if `failed_count` is climbing, your archive command is failing and WAL files cannot be cleaned up. Check `last_archived_time` and `last_failed_time` to see when the problem started. Common causes include insufficient storage at the archive destination, permission issues, or network failures. As noted in the `archive-log-failures` insight, this directly prevents WAL cleanup and risks point-in-time recovery capabilities.
3Measure replication lag on active slots
For any slots that ARE active (state = 'active' in `pg_replication_slots`), check `replication_lag_seconds` to see if replicas are keeping up with the primary. If lag is measured in hours or days, WAL files must be retained until the replica catches up. Cross-reference with `postgresql.wal.age` to understand how far behind the oldest retained WAL file is. If replication is healthy but WAL still accumulates, you likely have an inactive slot or archiving issue.
4Assess immediate disk pressure and WAL generation rate
Check current disk usage with `df -h` and WAL directory size with `du -sh /var/lib/postgresql/*/pg_wal/`. Look at `postgresql.wal.bytes` to understand your WAL generation rate — if you're generating 10GB/hour and have 20% disk remaining, you have ~2 hours to act. Count files in pg_wal to understand the backlog. The `disk-full-prevents-wal-writes` insight warns that once disk is full, PostgreSQL cannot write new WAL and will shut down, so calculate your runway carefully.
5Drop inactive slots or fix archiving before manual cleanup
Based on findings, take targeted action: if you found inactive slots (step 1), drop them immediately with `SELECT pg_drop_replication_slot('slot_name')` — WAL cleanup will happen automatically within minutes. If archiving is failing (step 2), fix the archive_command and verify successful archiving before proceeding. Never manually delete WAL files if you have active replication slots or archiving enabled — you risk breaking replication per the `wal-deletion-breaks-replication` insight. Only if you have no replication and no archiving should you consider manual WAL deletion, and even then, use `pg_archivecleanup` rather than `rm`.
6Set preventive limits to avoid recurrence
After resolving the immediate crisis, configure `max_slot_wal_keep_size` (PostgreSQL 13+) to limit how much WAL a single slot can retain before being invalidated — this prevents future runaway WAL bloat. Set this to a value that gives replicas reasonable time to catch up (e.g., 50GB for a system generating 10GB WAL/day) but prevents disk exhaustion. Monitor `replication_slot_safe_wal_size` going forward to catch slots approaching their limit. As seen in the `wal-accumulation-replication-lag` insight, setting this limit is critical to preventing disk-full scenarios.
Technologies
Related Insights
Excessive WAL file accumulation fills disk when replication lags or archiving fails
prerequisitecritical
Archive log failures risk data loss during point-in-time recovery
prerequisitecritical
WAL directory bloats to terabytes when replication consumers never reconnect
critical
Disk full errors from uncontrolled log file and media growth
critical
Logical replication slot retained 180 GB of WAL causing disk pressure
critical
WAL accumulation causing disk space crisis
diagnostic_step
Stub insight created from scenario discovery hint: WAL accumulation causing disk space crisis
archive command failing
prerequisite
Stub insight created from scenario discovery hint: archive command failing
Inactive replication slots prevent autovacuum from removing dead tuples cluster-wide
prerequisitecritical
Disk full prevents WAL file writes causing transaction failures and server shutdown
outcomecritical
replication slot preventing WAL cleanup
prerequisite
Stub insight created from scenario discovery hint: replication slot preventing WAL cleanup
WAL segment deletion causes replication lag and requires replica rebuild
supporting_evidencecritical
Disk space exhaustion from WAL files
diagnostic_step
Stub insight created from scenario discovery hint: Disk space exhaustion from WAL files
Relevant Metrics
postgresql.wal.bytespostgresql.archiver.archived_countpostgresql.archiver.failed_countpostgresql.wal.agereplication_slot_wal_statuspostgresql.database.sizelast_archived_timelast_failed_timepostgresql.wal.recordswal_bytes(primary_diagnostic)postgresql.wal_size(primary_diagnostic)failed_count(threshold_indicator)archived_count(monitor)replication_slot_safe_wal_size(secondary_diagnostic)replication_lag_seconds(secondary_diagnostic)replication_slot_restart_delay_bytes(secondary_diagnostic)Monitoring Interfaces
PostgreSQL Native