Disk space exhaustion from WAL files

criticalIncident Response

Write-Ahead Log directory growing uncontrolled due to archiving failures or replication issues, threatening database availability

Prompt: “My PostgreSQL pg_wal directory is consuming 80% of available disk space and growing. I have replication configured—how do I safely clean this up without breaking replication or causing data loss?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,801vs2,518

Latency

33.1smsvs55.4sms

Turns

5vs22

Tool Calls

1vs10

Schema Calls

0vs6

Response

3,331 charsvs3,102 chars

Evaluation

Without Schema

B+

With Schema

A-

The response without Schema provides comprehensive coverage of multiple potential causes (inactive slots, replication lag, archive failures, wal_keep_size settings) with extensive diagnostic queries. However, it's somewhat broad and doesn't clearly prioritize the most common culprit. The response with Schema is more focused and strategic, immediately homing in on replication slots as the primary cause and providing a clearer decision framework with explicit verification criteria. It also cites official PostgreSQL documentation sources and mentions newer preventive features like max_slot_wal_keep_size. Both responses require diagnostic output to proceed, but the Schema-augmented response provides a more streamlined path to resolution with authoritative backing.

Without Schema

1 tool calls

I'll help you safely address this pg_wal growth issue. This is typically caused by WAL files not being cleaned up due to replication lag or archiving issues. Let's diagnose and fix it carefully.

## First, let's diagnose the issue:
Let me provide you with the diagnostic commands and cleanup steps directly:

## **Step 1: Diagnose the Root Cause**

Run these commands to identify why WAL is accumulating:

```bash
# Check replication status
psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# Check replication slots (CRITICAL - this is often the culprit)
psql -U postgres -c "SELECT slot_name, slot_type, active, restart_lsn, 
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
    FROM pg_replication_slots;"

# Check archive status
psql -U postgres -c "SELECT archived_count, failed_count, last_archived_wal, last_failed_wal 
    FROM pg_stat_archiver;"

# Check WAL settings
psql -U postgres -c "SHOW wal_keep_size; SHOW max_wal_size; SHOW archive_command;"
```

## **Common Causes & Solutions:**

### **1. Inactive Replication Slot (Most Common)**
If you see an **inactive** replication slot with large `retained_wal`:

```sql
-- Find inactive slots
SELECT slot_name, active, retained_wal FROM pg_replication_slots WHERE NOT active;

-- If the replica is permanently gone, DROP the slot:
SELECT pg_drop_replication_slot('slot_name_here');
```

### **2. Replication Lag**
If replicas are active but lagging:

```sql
-- Check lag on each replica
SELECT client_addr, state, 
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
    FROM pg_stat_replication;
```

**Fix**: Investigate why replicas are slow (network, disk I/O, long queries)

### **3. Archive Command Failures**
If `failed_count` is high in `pg_stat_archiver`:

- Check `last_failed_wal` for error details
- Verify archive destination has space
- Test archive_command manually

### **4. wal_keep_size Too High**
If set excessively high:

```sql
-- Reduce it (requires reload, not restart)
ALTER SYSTEM SET wal_keep_size = '2GB';
SELECT pg_reload_conf();
```

## **Step 2: Safe Cleanup**

**DO NOT manually delete files from pg_wal!** PostgreSQL manages this automatically once the root cause is fixed.

**Safe approach:**

1. **Fix the root cause** (drop inactive slots, fix archive issues)
2. **Force a checkpoint** to trigger cleanup:
   ```sql
   CHECKPOINT;
   ```
3. **Wait 5-10 minutes** - PostgreSQL will auto-remove old WAL files
4. **Verify**:
   ```bash
   watch -n 10 'du -sh /var/lib/postgresql/data/pg_wal'
   ```

## **Emergency: If Disk is 95%+ Full**

Only if critically full and you've confirmed no active replicas need the WAL:

```sql
-- Temporarily reduce wal_keep_size
ALTER SYSTEM SET wal_keep_size = '512MB';
SELECT pg_reload_conf();
CHECKPOINT;
```

**Monitor** your replicas closely after this to ensure they don't break.

## **Prevention:**

```sql
-- Set up slot-based replication with max_slot_wal_keep_size (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';

-- Monitor with alerts
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) 
    FROM pg_replication_slots WHERE NOT active;
```

**What's the output of the diagnostic queries?** That will tell us exactly what's causing your WAL accumulation, and I can provide specific next steps.

With Schema

10 tool calls6 Schema

I'll get you expert guidance on safely cleaning up your PostgreSQL WAL directory while preserving replication.
Your pg_wal bloat is almost certainly caused by **inactive or lagging replication slots** preventing WAL file cleanup. Here's how to safely resolve this:

## Immediate Diagnostics

**1. Check replication slot status:**
```sql
SELECT slot_name, 
       slot_type,
       active,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
       age(xmin) AS xmin_age,
       age(catalog_xmin) AS catalog_xmin_age
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
```

This shows which slots are retaining WAL and how much disk space they're holding.

**2. Cross-reference with active replication:**
```sql
SELECT pid, client_hostname, state, backend_xmin
FROM pg_stat_replication
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC;
```

## Safe Cleanup Strategy

**If you find inactive slots (active = false):**

These are slots for replicas that are disconnected or no longer exist. They're the primary cause of WAL bloat.

⚠️ **Before dropping a slot**, verify:
- Is the replica permanently gone? → Safe to drop
- Is it temporarily down but will reconnect? → **Do NOT drop** (replica will need full rebuild)
- Is it for a downstream system you no longer use? → Safe to drop

**To drop an inactive slot:**
```sql
SELECT pg_drop_replication_slot('slot_name_here');
```

**After dropping slots**, WAL files will be cleaned up automatically during the next checkpoint cycle.

**If slots are active but lagging significantly:**

The replica is connected but falling behind. Check:
- Network issues between primary and replica
- Replica resource constraints (CPU, disk I/O)
- Large transactions blocking replication apply

## Preventive Configuration

Set limits to auto-invalidate problematic slots (PostgreSQL 13+):

```sql
-- Limit WAL retention to 100GB per slot
ALTER SYSTEM SET max_slot_wal_keep_size = '100GB';
SELECT pg_reload_conf();
```

This prevents any single slot from consuming unlimited disk space.

## Emergency Disk Space Recovery

If you're critically low on disk and need immediate relief:

**DO NOT** manually delete files from pg_wal — this causes corruption.

Instead:
1. Drop inactive slots first (safest)
2. Consider temporarily dropping less-critical active slots (requires replica rebuild)
3. Force a checkpoint: `CHECKPOINT;`
4. Monitor: `SELECT pg_ls_waldir() ORDER BY modification DESC;`

## Critical Warning

Dropping an active replication slot **breaks that replica's connection**. The replica will need to be rebuilt from a fresh base backup using `pg_basebackup`. Only drop active slots if you're prepared to rebuild.

**Source documentation:**
- [PostgreSQL Routine Vacuuming](https://www.postgresql.org/docs/current/routine-vacuuming.html)
- [Debugging autovacuum problems](https://www.citusdata.com/blog/2022/07/28/debugging-postgres-autovacuum-problems-13-tips)

Run the diagnostic queries first and share the output if you need help deciding which slots are safe to drop.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing WAL directory bloat in PostgreSQL with replication, start by identifying inactive or orphaned replication slots — the most common cause of unbounded WAL retention. Then check whether archiving is failing or active replicas are severely lagged. Only after understanding the root cause can you safely clean up WAL files without breaking replication or losing data.

1Check for inactive or orphaned replication slots

Query `pg_replication_slots` to identify slots that are inactive or have consumers that never reconnected. Check `replication_slot_wal_status` — if it shows 'reserved' or 'extended', that slot is holding WAL files. Look at `replication_slot_restart_delay_bytes` to see how much WAL each slot is retaining. In my experience, this is the #1 cause of WAL bloat: a slot created for a replica or logical subscription that was decommissioned without dropping the slot, causing WAL to accumulate indefinitely (sometimes 900+ GB as seen in the `wal-directory-bloat-orphaned-slots` insight).

WAL directory bloats to terabytes when replication consumers never reconnect Logical replication slot retained 180 GB of WAL causing disk pressure Inactive replication slots prevent autovacuum from removing dead tuples cluster-wide

2Check archive command status and failures

Query `pg_stat_archiver` and compare `archived_count` versus `failed_count` — if `failed_count` is climbing, your archive command is failing and WAL files cannot be cleaned up. Check `last_archived_time` and `last_failed_time` to see when the problem started. Common causes include insufficient storage at the archive destination, permission issues, or network failures. As noted in the `archive-log-failures` insight, this directly prevents WAL cleanup and risks point-in-time recovery capabilities.

Archive log failures risk data loss during point-in-time recovery Excessive WAL file accumulation fills disk when replication lags or archiving fails

3Measure replication lag on active slots

For any slots that ARE active (state = 'active' in `pg_replication_slots`), check `replication_lag_seconds` to see if replicas are keeping up with the primary. If lag is measured in hours or days, WAL files must be retained until the replica catches up. Cross-reference with `postgresql.wal.age` to understand how far behind the oldest retained WAL file is. If replication is healthy but WAL still accumulates, you likely have an inactive slot or archiving issue.

Excessive WAL file accumulation fills disk when replication lags or archiving fails WAL segment deletion causes replication lag and requires replica rebuild

4Assess immediate disk pressure and WAL generation rate

Check current disk usage with `df -h` and WAL directory size with `du -sh /var/lib/postgresql/*/pg_wal/`. Look at `postgresql.wal.bytes` to understand your WAL generation rate — if you're generating 10GB/hour and have 20% disk remaining, you have ~2 hours to act. Count files in pg_wal to understand the backlog. The `disk-full-prevents-wal-writes` insight warns that once disk is full, PostgreSQL cannot write new WAL and will shut down, so calculate your runway carefully.

Disk full prevents WAL file writes causing transaction failures and server shutdown Disk full errors from uncontrolled log file and media growth

5Drop inactive slots or fix archiving before manual cleanup

Based on findings, take targeted action: if you found inactive slots (step 1), drop them immediately with `SELECT pg_drop_replication_slot('slot_name')` — WAL cleanup will happen automatically within minutes. If archiving is failing (step 2), fix the archive_command and verify successful archiving before proceeding. Never manually delete WAL files if you have active replication slots or archiving enabled — you risk breaking replication per the `wal-deletion-breaks-replication` insight. Only if you have no replication and no archiving should you consider manual WAL deletion, and even then, use `pg_archivecleanup` rather than `rm`.

WAL directory bloats to terabytes when replication consumers never reconnect WAL segment deletion causes replication lag and requires replica rebuild Archive log failures risk data loss during point-in-time recovery

6Set preventive limits to avoid recurrence

After resolving the immediate crisis, configure `max_slot_wal_keep_size` (PostgreSQL 13+) to limit how much WAL a single slot can retain before being invalidated — this prevents future runaway WAL bloat. Set this to a value that gives replicas reasonable time to catch up (e.g., 50GB for a system generating 10GB WAL/day) but prevents disk exhaustion. Monitor `replication_slot_safe_wal_size` going forward to catch slots approaching their limit. As seen in the `wal-accumulation-replication-lag` insight, setting this limit is critical to preventing disk-full scenarios.

Excessive WAL file accumulation fills disk when replication lags or archiving fails Logical replication slot retained 180 GB of WAL causing disk pressure