Replication lag threatening data consistency

criticalIncident Response

Primary-replica replication lag increasing, risking stale reads and potential data loss during failover

Prompt: My PostgreSQL read replicas are falling behind the primary by 30 seconds and climbing. Help me diagnose if this is a resource bottleneck, network issue, or replication slot problem before it causes an outage.

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
7,911vs6,796
Latency
12.2mmsvs1.8mms
Turns
13vs12
Tool Calls
6vs6
Schema Calls
0vs0
Response
2,654 charsvs2,942 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses provide actionable diagnostic queries covering resource, network, and replication slot issues. The response without Schema offers solid LSN-based diagnostics but recommends disabling hot_standby_feedback, which typically worsens recovery conflicts rather than helping. The response with Schema uses more modern PostgreSQL features (write_lag/flush_lag/replay_lag columns), introduces the clearer "backlog" diagnostic concept, correctly recommends enabling hot_standby_feedback to reduce conflicts, and includes more comprehensive network diagnostics (mtr for packet loss) plus emergency disk space considerations. The with-Schema response is also better structured with clear A/B/C sections and an emergency action protocol, making it easier to follow under pressure.
Without Schema
6 tool calls
I'll help you diagnose this PostgreSQL replication lag issue quickly. Let me first check if you have any relevant database configuration or monitoring scripts in your current directory, then provide you with diagnostic queries. I'll create a comprehensive diagnostic script to help you identify the root cause. This is time-sensitive, so let's gather all the key metrics quickly. I've created three diagnostic tools for you, but since this is urgent, here are the **immediate steps** you should take right now: ## 🚨 RUN THESE QUERIES IMMEDIATELY ### On your **REPLICA** database: ```sql -- Check current lag and where the bottleneck is SELECT NOW() - pg_last_xact_replay_timestamp() AS replication_lag, pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) / 1024 / 1024 AS receive_replay_delta_mb; ``` ### On your **PRIMARY** database: ```sql -- Check replication status and bottleneck location SELECT client_addr, state, pg_wal_lsn_diff(sent_lsn, write_lsn) / 1024 / 1024 AS pending_write_mb, pg_wal_lsn_diff(write_lsn, flush_lsn) / 1024 / 1024 AS pending_flush_mb, pg_wal_lsn_diff(flush_lsn, replay_lsn) / 1024 / 1024 AS pending_replay_mb FROM pg_stat_replication; ``` ## 📊 HOW TO INTERPRET RESULTS **If `pending_replay_mb` is HIGH (>100MB):** - **RESOURCE BOTTLENECK** - Replica CPU/disk can't keep up - Check replica CPU: `top` or `htop` - Check disk I/O: `iostat -x 2 5` - **Quick fix:** Kill long-running queries on replica: ```sql SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state != 'idle' AND pid != pg_backend_pid() AND query_start < NOW() - INTERVAL '5 minutes'; ``` **If `pending_write_mb` or `pending_flush_mb` is HIGH:** - **NETWORK ISSUE** - Data not reaching replica - Test network: `ping <primary_ip>` - Check PostgreSQL logs for connection errors **If replica replication slot shows issues:** ```sql -- On primary SELECT slot_name, active, pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS retained_wal_mb FROM pg_replication_slots; ``` - If `active = false` or `retained_wal_mb` > 5GB: **REPLICATION SLOT PROBLEM** ## ⚡ IMMEDIATE FIXES **Most common cause (Resource Bottleneck):** ```sql -- On replica: Disable hot_standby_feedback temporarily ALTER SYSTEM SET hot_standby_feedback = off; SELECT pg_reload_conf(); ``` **Check for blocking queries:** ```sql -- On replica SELECT pid, state, wait_event_type, NOW() - query_start AS runtime FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start LIMIT 10; ``` Run these diagnostic commands and share the results - I can then pinpoint the exact issue and provide specific fixes!
With Schema
6 tool calls
I'll help you diagnose this replication lag issue. Let me start by checking your environment and then provide diagnostic queries to identify the root cause. Given the urgency of your replication lag issue, let me give you immediate diagnostic steps and solutions: ## **Quick Diagnosis - Run These Now:** ### **On PRIMARY server:** ```sql -- 1. Check current lag and identify bottleneck type SELECT client_addr, application_name, state, write_lag, flush_lag, replay_lag, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag_size FROM pg_stat_replication; ``` ### **On REPLICA server:** ```sql -- 2. Identify if it's network vs. apply bottleneck SELECT pg_last_wal_receive_lsn() AS received, pg_last_wal_replay_lsn() AS replayed, pg_size_pretty(pg_wal_lsn_diff( pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn() )) AS backlog; ``` **Interpret the results:** - **If `backlog` is large (>100MB)**: **CPU/Disk I/O bottleneck on replica** - WAL is arriving but not being applied fast enough - **If `backlog` is small/zero**: **Network issue** - WAL isn't reaching the replica fast enough --- ## **Three Root Causes & Solutions:** ### **A. Network Bottleneck** (backlog = 0 on replica) Check bandwidth saturation between primary and replica: ```bash # On replica, test latency to primary ping -c 20 <primary-ip> # Check for packet loss mtr -c 100 <primary-ip> ``` **Immediate fixes:** ```sql -- On primary, enable WAL compression ALTER SYSTEM SET wal_compression = on; SELECT pg_reload_conf(); ``` ### **B. CPU/Disk Bottleneck on Replica** (backlog > 0) ```bash # Check disk I/O wait iostat -x 1 5 # Check CPU usage top -b -n 1 | grep postgres ``` **Check for recovery conflicts:** ```sql -- On replica SELECT * FROM pg_stat_database_conflicts; ``` **Immediate fixes:** ```sql -- On replica, enable hot standby feedback to reduce conflicts ALTER SYSTEM SET hot_standby_feedback = on; SELECT pg_reload_conf(); ``` ### **C. Replication Slot Problem** ```sql -- On primary, check slot status SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots; ``` **If slot is inactive:** - Replica crashed/disconnected - check replica PostgreSQL logs - WAL accumulating on primary disk - monitor disk space urgently --- ## **Emergency Action (if lag >5 min):** ```sql -- On primary: Check WAL generation rate SELECT pg_current_wal_lsn(); -- Wait 10 seconds, then run again to calculate bytes/sec -- If primary disk filling up, check space: SELECT pg_size_pretty(sum(size)) AS wal_size FROM pg_ls_waldir(); ``` **Critical decision point:** - If primary disk <10% free: You may need to drop the replication slot and rebuild replica - Otherwise: Focus on fixing replica bottleneck Run these diagnostics and share the results - I'll help you pinpoint the exact cause and solution.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing replication lag, start by quantifying the lag type (network vs replay bottleneck) using the different lag metrics, then systematically check replica resources, blocking queries, and network throughput. The most common culprits are under-provisioned replicas or long-running queries blocking WAL replay, but you'll also want to verify WAL generation rate and check for replication slot issues before they cause disk exhaustion.

1Quantify and classify the lag type
Check `replay-lag`, `flush-lag`, and `write-lag` to pinpoint where the bottleneck lives. If `replay-lag` is significantly higher than `write-lag` and `flush-lag`, the replica is struggling to apply WAL (CPU/disk issue). If `write-lag` is elevated, you're looking at a network problem. Per the `replication-lag-stale-data` insight, alert when replay_lag exceeds 5 seconds sustained for 2 minutes, critical at 30+ seconds. Also check `postgresql.replication.replay_lsn_delay` — critical when it exceeds 100MB.
2Check replica server resource utilization
This is the most common cause of replication lag. Check CPU utilization, disk I/O wait times, and available memory on the replica server. The `replication-lag` insight explicitly calls out under-provisioned replicas as a frequent cause — sustained lag above a few megabytes under normal load indicates the replica hardware can't keep up with primary write volume. If CPU is saturated or disk I/O shows high wait times, you need to scale up the replica.
3Identify long-running queries blocking WAL replay
Query pg_stat_activity on the replica to find queries that have been running for extended periods — these can block WAL replay and cause lag to spike. The `replication-lag-stale-data` insight specifically mentions this pattern. Look for queries with state='active' and query_start timestamps more than a few minutes old. If you find blocking queries, consider setting hot_standby_feedback=on or adjusting max_standby_streaming_delay to balance read availability with replication performance.
4Verify network bandwidth and throughput
Check network saturation between primary and replica, especially if `write-lag` is elevated. Monitor network throughput during peak write periods — if you're hitting bandwidth limits, the replica can't receive WAL fast enough. The `replication-lag-standby` insight suggests adjusting max_wal_senders and wal_keep_segments on primary to optimize network utilization. Also verify there are no firewall or routing issues causing packet loss or high latency between nodes.
5Check WAL generation rate and LSN progression
Compare `sent-lsn` on the primary with `replay-lsn` on the replica using pg_stat_replication and calculate the byte difference with pg_wal_lsn_diff(sent_lsn, replay_lsn). If the primary is generating WAL at an unsustainable rate (e.g., from bulk operations or inefficient queries), the replica may never catch up. The `replication-lag-cpu-saturation` insight describes a case where bad query plans caused excessive WAL generation — check pg_stat_statements on the primary for queries with unusually high WAL bytes written.
6Check for WAL accumulation and replication slot problems
Run `du -sh /var/lib/postgresql/*/main/pg_wal/` and `ls /var/lib/postgresql/*/main/pg_wal/ | wc -l` to check for excessive WAL file accumulation. The `wal-accumulation-replication-lag` insight warns that without limits, WAL files can fill the disk entirely during prolonged lag. Query pg_replication_slots to verify your slot is active and not falling further behind. Set max_slot_wal_keep_size to prevent unbounded growth, and investigate whether the replica is actually consuming from the slot.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

PostgreSQL Prometheus
PostgreSQL Datadog
PostgreSQL Native
PostgreSQL CloudWatch Metrics
PostgreSQL Prometheus Exporter (pg_exporter)