WAL Accumulation and Disk Space Crisis

criticalIncident Response

Write-Ahead Log (WAL) files are accumulating and consuming disk space, risking disk full condition.

Prompt: “My PostgreSQL instance on RDS is running out of disk space and I see the WAL directory has grown to 200GB. Why isn't WAL archiving or cleanup happening and how do I fix this urgently?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,187vs1,159

Latency

27.3smsvs25.0sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

2,846 charsvs2,906 chars

Evaluation

Without Schema

B+

With Schema

A-

The response with Schema provides more comprehensive diagnostic queries (4 vs 3), including an explicit archive status check via pg_stat_archiver that the response without Schema doesn't include. It also offers more RDS-specific guidance, mentioning CloudWatch logs for archive errors, S3 bucket permissions, and more detailed replication lag monitoring with lag_bytes calculations. Both responses cover the critical causes (replication slots, long-running transactions) with actionable SQL queries and emergency procedures, but the response with Schema provides better prioritization (e.g., '80% of cases') and mentions PostgreSQL 13+ features like max_slot_wal_keep_size that the other misses. The RDS integration is notably stronger in the response with Schema, which would help operators navigate AWS-specific considerations more effectively.

Without Schema

WAL files accumulating on RDS PostgreSQL is typically caused by **replication slots** or **archiving failures**. Here's how to diagnose and fix it urgently:

## Immediate Diagnosis

Connect to your RDS instance and run these queries:

```sql
-- Check for inactive/lagging replication slots (most common cause)
SELECT slot_name, slot_type, active, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY restart_lsn;

-- Check for long-running transactions
SELECT pid, usename, state, state_change, 
       now() - xact_start AS duration,
       query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;

-- Check current WAL usage
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS current_wal_location;
```

## Common Causes & Fixes

### 1. **Inactive Replication Slots** (Most Common)
If you see slots with `active = false` or large `retained_wal`:

```sql
-- Drop inactive/old slots (CAUTION: ensure they're not needed)
SELECT pg_drop_replication_slot('slot_name_here');
```

**Why this happens**: Logical replication slots (e.g., from CDC tools like Debezium, pglogical, or abandoned read replicas) prevent WAL cleanup until the consumer catches up.

### 2. **Lagging Read Replicas**
Check replica lag in AWS Console or:
```sql
-- On replica:
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
```

**Fix**: If a replica is hopelessly behind, consider rebuilding it or temporarily removing it.

### 3. **Archive Command Failures** (RDS-specific)
RDS manages archiving automatically, but check:
- Backup retention window in RDS settings
- Any errors in RDS logs (CloudWatch Logs)

### 4. **Prevented by `wal_keep_size`/`wal_keep_segments`**
Check your parameter group:
```sql
SHOW wal_keep_size;  -- PostgreSQL 13+
SHOW wal_keep_segments;  -- PostgreSQL 12 and earlier
```

## Immediate Emergency Actions

If you're at >90% disk:

1. **Increase storage immediately** (RDS allows online resize):
   ```bash
   aws rds modify-db-instance \
     --db-instance-identifier your-db \
     --allocated-storage NEW_SIZE \
     --apply-immediately
   ```

2. **Drop unused replication slots** (after verifying they're not needed)

3. **Force checkpoint** (limited help, but worth trying):
   ```sql
   CHECKPOINT;
   ```

4. **Terminate long-running queries** (if found):
   ```sql
   SELECT pg_terminate_backend(pid);
   ```

## Prevention

- **Monitor replication slots**: Set up CloudWatch alarms on `OldestReplicationSlotLag`
- **Set `max_slot_wal_keep_size`** (PostgreSQL 13+) to prevent unbounded WAL growth
- **Regular slot cleanup**: Remove slots from decommissioned consumers
- **Monitor disk space**: Alert at 70-80% usage

What do those diagnostic queries show? That will tell us exactly which issue you're hitting.

With Schema

I'll help you diagnose and fix this WAL accumulation issue urgently. Here are the most common causes and solutions for RDS PostgreSQL:

## **Immediate Diagnostics**

Run these queries to identify the root cause:

```sql
-- 1. Check for inactive replication slots (MOST COMMON CAUSE)
SELECT slot_name, slot_type, database, active, 
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY restart_lsn;

-- 2. Check for long-running transactions
SELECT pid, usename, application_name, state, 
       now() - xact_start AS transaction_age,
       now() - state_change AS state_age
FROM pg_stat_activity
WHERE state != 'idle' AND xact_start IS NOT NULL
ORDER BY xact_start;

-- 3. Check archive status
SELECT archived_count, failed_count, 
       stats_reset
FROM pg_stat_archiver;

-- 4. Check current WAL usage
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS wal_written;
```

## **Most Likely Causes & Fixes**

### **1. Inactive Replication Slots** (80% of cases)
If you see inactive slots in the first query:

```sql
-- DROP inactive slots immediately
SELECT pg_drop_replication_slot('slot_name_here');
```

**RDS-specific:** Check for Read Replicas that were deleted but left slots behind.

### **2. Long-Running Transactions**
If you see transactions running for hours/days:

```sql
-- Terminate blocking transaction
SELECT pg_terminate_backend(pid_here);
```

### **3. Archive Command Failures**
Check RDS logs in CloudWatch for archive errors. Common fix:
- Verify S3 bucket permissions (if using custom archiving)
- Check `archive_timeout` parameter

### **4. Replication Lag**
```sql
-- Check lag on standby/read replicas
SELECT client_addr, state, sync_state,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```

If lag is huge, consider temporarily removing the replica.

## **Emergency Space Recovery**

If you need immediate space:

1. **Increase RDS storage** (fastest, safest):
   ```bash
   aws rds modify-db-instance \
     --db-instance-identifier your-db \
     --allocated-storage NEW_SIZE \
     --apply-immediately
   ```

2. **Force checkpoint**:
   ```sql
   CHECKPOINT;
   ```

3. **Adjust WAL retention** (RDS parameter group):
   - Set `rds.logical_replication = 0` (if not needed)
   - Lower `wal_keep_size` or `wal_keep_segments`

## **RDS-Specific Commands**

Check current WAL parameters:
```sql
SHOW wal_keep_size;  -- or wal_keep_segments for older versions
SHOW max_wal_size;
SHOW archive_mode;
```

## **Prevent Future Issues**

1. Monitor replication slot lag regularly
2. Set up CloudWatch alarms for:
   - Free storage space < 20%
   - Oldest replication slot lag > threshold
3. Use `max_slot_wal_keep_size` to prevent runaway growth (PostgreSQL 13+)

**What's your first diagnostic query result?** That will tell us exactly which fix to apply.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When WAL files accumulate dangerously on RDS PostgreSQL, the most common culprit is inactive replication slots preventing cleanup — check those first. Next, verify your archiving process isn't silently failing by examining pg_stat_archiver metrics. Only after ruling out slots and archiving should you investigate replication lag or configuration issues like wal_keep_size.

1Check for inactive or stuck replication slots

This is the #1 cause of WAL accumulation in my experience. Run `SELECT slot_name, active, catalog_xmin, restart_lsn FROM pg_replication_slots;` and look for slots where `active = false` or where `catalog_xmin` is set (indicates a logical replication slot). Inactive slots prevent WAL cleanup indefinitely — a single forgotten slot can consume hundreds of GB. If you find unused slots, drop them immediately with `SELECT pg_drop_replication_slot('slot_name');` and you should see disk space recover within minutes as PostgreSQL recycles old WAL files.

Inactive replication slots prevent autovacuum from removing dead tuples cluster-wide replication slot preventing WAL cleanup replication_slot_safe_wal_size

2Verify WAL archiving is functioning

Query `pg_stat_archiver` and compare `archived_count` versus `failed_count`. If `failed_count` is climbing, your archive command is failing silently and WAL files are piling up because PostgreSQL won't delete them until they're successfully archived. On RDS, check CloudWatch logs for archive-related errors and verify your S3 bucket permissions, disk space on the archive destination, or network connectivity. A failing archive command is a point-in-time recovery risk as well as a disk space crisis.

archive command failing Archive log failures risk data loss during point-in-time recovery archived_countfailed_count

3Measure replication lag from primary to replicas

Check `replication_lag_seconds` for all replicas — if any replica is hours or days behind, the primary must retain WAL files until the replica catches up. On the primary, query `SELECT client_addr, state, sync_state, replay_lag FROM pg_stat_replication;` to see lag per replica. If a replica is stuck or lagging badly (e.g., >30 minutes), you may need to rebuild it from a fresh snapshot rather than wait for it to catch up, or temporarily remove it from replication to allow WAL cleanup.

Excessive WAL file accumulation fills disk when replication lags or archiving fails replication_lag_secondswal_bytes

4Review wal_keep_size configuration

Check `SHOW wal_keep_size;` — if this is set to a large value (e.g., 50GB or more), PostgreSQL is explicitly retaining that much WAL even when it's not needed. On RDS, this is sometimes set conservatively high to protect replicas during brief network issues, but if you're not actually using streaming replication or your replicas are healthy, you can reduce it. Compare `wal_keep_size` to your actual `wal_bytes` generation rate to see if it's appropriate for your workload.

wal_keep_size set too high wal_bytespostgresql.wal_size

5Take immediate action if disk is critically full

If disk usage is above 90%, you're in emergency territory — once disk is 100% full, PostgreSQL can't write WAL and will shut down or reject transactions. On RDS, you can't manually delete WAL files, but you can temporarily increase storage (RDS allows online resize) to buy time while you fix the root cause identified in steps 1-4. Monitor `postgresql.wal_size` closely and expect it to drop significantly once you remove inactive slots or fix archiving failures.

Disk full prevents WAL file writes causing transaction failures and server shutdown disk full risk WAL accumulation causing disk space crisis postgresql.wal_size