Replication Slot Bloat and Cleanup

warningIncident Response

Managing PostgreSQL replication slots that accumulate WAL files and prevent vacuum from progressing, requiring careful cleanup without breaking replication.

Prompt: My PostgreSQL disk is filling up with WAL files and I see an inactive replication slot that hasn't been used in days. It's preventing vacuum from cleaning up old data. Can I safely drop this slot or will it break replication?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating replication slot bloat and WAL accumulation, start by identifying inactive slots and quantifying their WAL retention impact. Next, verify whether the slot's consumer still exists and check if the slot is blocking vacuum from advancing. Finally, assess transaction wraparound risk to understand urgency. Most cases involve orphaned slots from decommissioned replicas that can be safely dropped.

1Identify inactive replication slots and their activity status
Query pg_replication_slots to see which slots exist and check `replication_slot_is_active` to identify inactive ones. A slot showing active=false for days is a red flag—it means no consumer has connected recently. This is the most common cause of WAL bloat, as seen in the `inactive-replication-slots-block-autovacuum` pattern where slots from decommissioned servers or failed logical replication setups sit idle indefinitely. If a slot hasn't been active in days and you don't recognize the slot name, it's likely orphaned.
2Quantify WAL retention caused by the suspect slot
Check `replication_slot_restart_delay_bytes` to see how much WAL is being retained specifically for this slot—values in the hundreds of GB or higher indicate serious bloat. Also look at `replication_slot_safe_wal_size` to understand if the slot is approaching or has exceeded safe limits. The `wal-directory-bloat-orphaned-slots` insight shows cases where inactive slots caused 900+ GB of WAL accumulation in pg_wal directory, consuming most of the disk space while active databases used only ~70 GB. If restart_delay is massive and the slot is inactive, you've found your culprit.
3Verify whether the slot's consumer still exists and will reconnect
Cross-reference your slot name with your infrastructure inventory—does the replica or logical replication subscriber still exist? Check pg_stat_replication to see if the slot has any active connections now or in monitoring history. The `wal-directory-bloat-orphaned-slots` insight describes slots waiting for consumers that will never return, often from decommissioned servers. If the server was decomissioned, the subscription was removed, or the replica was rebuilt without preserving the slot name, it's safe to drop. If you're unsure, coordinate with your team to confirm before proceeding.
4Check if the slot is preventing vacuum from advancing
Query pg_replication_slots and look at age(xmin) or age(catalog_xmin)—large values (millions of transactions) mean the slot is holding back vacuum cluster-wide. The `old-replication-slots-prevent-vacuum` insight warns that this prevents VACUUM from advancing relfrozenxid, which can lead to transaction ID wraparound. The `inactive-replication-slots-block-autovacuum` pattern shows slots causing autovacuum to run continuously at high CPU (80%+ during business hours) without actually removing dead tuples. If age values are in the millions and the slot is inactive, dropping it will immediately unblock vacuum.
5Assess transaction wraparound risk to understand urgency
Check `database-wraparound-age-datfrozenxid` across all databases—if any database shows age approaching 2 billion transactions (the emergency autovacuum threshold), you're in a critical situation. The `old-replication-slots-prevent-vacuum` insight explains that slots preventing vacuum from advancing relfrozenxid can lead to wraparound, which will force PostgreSQL into emergency read-only mode. If wraparound age is over 1.5 billion and climbing, this is a drop-everything-now situation. Drop the orphaned slot immediately and monitor vacuum progress to ensure datfrozenxid starts advancing again.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

PostgreSQL Datadog
PostgreSQL Native
PostgreSQL Prometheus Exporter (pg_exporter)
PostgreSQL Datadog Integration