Replication Slot Bloat and Cleanup

warningIncident Response

Managing PostgreSQL replication slots that accumulate WAL files and prevent vacuum from progressing, requiring careful cleanup without breaking replication.

Prompt: “My PostgreSQL disk is filling up with WAL files and I see an inactive replication slot that hasn't been used in days. It's preventing vacuum from cleaning up old data. Can I safely drop this slot or will it break replication?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

928vs699

Latency

19.2smsvs14.2sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

2,588 charsvs1,742 chars

Evaluation

Without Schema

A-

With Schema

B+

Both responses provide technically accurate and actionable guidance for dropping inactive replication slots. The response without Schema is notably more comprehensive, covering important edge cases like forcing slot drops when they're still marked active and showing how to terminate backend connections if needed. The response with Schema includes a helpful CHECKPOINT command to accelerate WAL cleanup but is less thorough overall in handling potential complications. Both correctly identify the safety conditions and consequences, though the without-Schema response provides more defensive steps that would be particularly valuable in production scenarios.

Without Schema

I'll help you safely handle this inactive replication slot. Let me break down the situation and provide a safe approach.

## Understanding the Issue

An inactive replication slot prevents PostgreSQL from:
- Removing old WAL files (causing disk space issues)
- Advancing the `xmin` horizon (blocking VACUUM from reclaiming dead tuples)

## Before Dropping - Check the Slot Status

First, let's examine your replication slots:

```sql
-- Check all replication slots
SELECT 
    slot_name,
    slot_type,
    database,
    active,
    active_pid,
    restart_lsn,
    confirmed_flush_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

-- Check WAL disk usage
SELECT pg_size_pretty(sum(size)) 
FROM pg_ls_waldir();
```

## Is It Safe to Drop?

**YES, it's safe to drop if:**
- `active = false` (the slot isn't currently connected)
- You've confirmed no replica is using it or plans to reconnect
- The slot has been inactive for days (as you mentioned)

**The consequence of dropping:**
- If a replica was using this slot and tries to reconnect, it will fail
- The replica will need to be rebuilt from a fresh backup or use `pg_rewind`

## How to Safely Drop the Slot

```sql
-- Drop the inactive slot
SELECT pg_drop_replication_slot('slot_name_here');
```

If you get an error that the slot is still active, you can force it:

```sql
-- Terminate the connection first (if needed)
SELECT pg_terminate_backend(active_pid) 
FROM pg_replication_slots 
WHERE slot_name = 'slot_name_here' AND active_pid IS NOT NULL;

-- Then drop it
SELECT pg_drop_replication_slot('slot_name_here');
```

## After Dropping - Verify Cleanup

```sql
-- Confirm the slot is gone
SELECT * FROM pg_replication_slots;

-- PostgreSQL will automatically clean up old WAL files
-- Check disk space after a few minutes
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();
```

## If You Need to Keep Replication

If this replica might come back, consider:
1. **Set `max_slot_wal_keep_size`** (PostgreSQL 13+) to prevent runaway WAL growth:
   ```sql
   ALTER SYSTEM SET max_slot_wal_keep_size = '50GB';
   SELECT pg_reload_conf();
   ```

2. **Use temporary slots** for replicas that might disconnect frequently

3. **Rebuild the replica** instead of keeping the slot indefinitely

**Bottom line:** If the slot has been inactive for days and you don't recognize it or know it's abandoned, **drop it**. Your disk space issue will resolve immediately.

Would you like me to help you write a monitoring query or script to prevent this from happening again?

With Schema

I can help you assess whether it's safe to drop that replication slot. Let me guide you through checking the situation and safely handling it.

## Check the replication slot status

First, let's identify the inactive slot and confirm it's truly unused:

```sql
-- View all replication slots and their status
SELECT 
    slot_name,
    slot_type,
    active,
    restart_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained_wal,
    pg_current_wal_lsn() - restart_lsn AS wal_lag_bytes
FROM pg_replication_slots;
```

## When it's safe to drop

**You can safely drop the slot if:**
- `active = false` (it's inactive)
- You've confirmed no standby/subscriber is using it
- You've verified it's not needed for point-in-time recovery
- The application/replica that used it is truly decommissioned

**Warning signs to investigate first:**
- If you have a standby server that's temporarily down but will return
- If the slot is for logical replication and a subscriber will reconnect
- If it's part of a backup/DR strategy

## Drop the slot

```sql
-- Drop the inactive slot
SELECT pg_drop_replication_slot('slot_name_here');
```

This immediately frees up WAL files for cleanup.

## Verify cleanup

After dropping:
```sql
-- Check WAL disk usage
SELECT pg_size_pretty(sum(size)) FROM pg_ls_waldir();

-- Trigger checkpoint to accelerate cleanup
CHECKPOINT;
```

## Prevent future issues

Consider setting a safety limit:
```sql
-- Set max WAL size retained per slot (PostgreSQL 13+)
ALTER SYSTEM SET max_slot_wal_keep_size = '10GB';
SELECT pg_reload_conf();
```

**Do you want me to help you draft the specific commands to check your slot, or do you need help identifying which replica/application was using it?**

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating replication slot bloat and WAL accumulation, start by identifying inactive slots and quantifying their WAL retention impact. Next, verify whether the slot's consumer still exists and check if the slot is blocking vacuum from advancing. Finally, assess transaction wraparound risk to understand urgency. Most cases involve orphaned slots from decommissioned replicas that can be safely dropped.

1Identify inactive replication slots and their activity status

Query pg_replication_slots to see which slots exist and check `replication_slot_is_active` to identify inactive ones. A slot showing active=false for days is a red flag—it means no consumer has connected recently. This is the most common cause of WAL bloat, as seen in the `inactive-replication-slots-block-autovacuum` pattern where slots from decommissioned servers or failed logical replication setups sit idle indefinitely. If a slot hasn't been active in days and you don't recognize the slot name, it's likely orphaned.

Inactive replication slots prevent autovacuum from removing dead tuples cluster-wide WAL directory bloats to terabytes when replication consumers never reconnect

2Quantify WAL retention caused by the suspect slot

Check `replication_slot_restart_delay_bytes` to see how much WAL is being retained specifically for this slot—values in the hundreds of GB or higher indicate serious bloat. Also look at `replication_slot_safe_wal_size` to understand if the slot is approaching or has exceeded safe limits. The `wal-directory-bloat-orphaned-slots` insight shows cases where inactive slots caused 900+ GB of WAL accumulation in pg_wal directory, consuming most of the disk space while active databases used only ~70 GB. If restart_delay is massive and the slot is inactive, you've found your culprit.

WAL directory bloats to terabytes when replication consumers never reconnect

3Verify whether the slot's consumer still exists and will reconnect

Cross-reference your slot name with your infrastructure inventory—does the replica or logical replication subscriber still exist? Check pg_stat_replication to see if the slot has any active connections now or in monitoring history. The `wal-directory-bloat-orphaned-slots` insight describes slots waiting for consumers that will never return, often from decommissioned servers. If the server was decomissioned, the subscription was removed, or the replica was rebuilt without preserving the slot name, it's safe to drop. If you're unsure, coordinate with your team to confirm before proceeding.

WAL directory bloats to terabytes when replication consumers never reconnect Old replication slots prevent vacuum from advancing relfrozenxid

4Check if the slot is preventing vacuum from advancing

Query pg_replication_slots and look at age(xmin) or age(catalog_xmin)—large values (millions of transactions) mean the slot is holding back vacuum cluster-wide. The `old-replication-slots-prevent-vacuum` insight warns that this prevents VACUUM from advancing relfrozenxid, which can lead to transaction ID wraparound. The `inactive-replication-slots-block-autovacuum` pattern shows slots causing autovacuum to run continuously at high CPU (80%+ during business hours) without actually removing dead tuples. If age values are in the millions and the slot is inactive, dropping it will immediately unblock vacuum.

Old replication slots prevent vacuum from advancing relfrozenxid Inactive replication slots prevent autovacuum from removing dead tuples cluster-wide

5Assess transaction wraparound risk to understand urgency

Check `database-wraparound-age-datfrozenxid` across all databases—if any database shows age approaching 2 billion transactions (the emergency autovacuum threshold), you're in a critical situation. The `old-replication-slots-prevent-vacuum` insight explains that slots preventing vacuum from advancing relfrozenxid can lead to wraparound, which will force PostgreSQL into emergency read-only mode. If wraparound age is over 1.5 billion and climbing, this is a drop-everything-now situation. Drop the orphaned slot immediately and monitor vacuum progress to ensure datfrozenxid starts advancing again.

Old replication slots prevent vacuum from advancing relfrozenxid database_wraparound_age_datfrozenxid