Replication lag threatening data consistency

criticalIncident Response

Primary-replica replication lag increasing, risking stale reads and potential data loss during failover

Prompt: “My PostgreSQL read replicas are falling behind the primary by 30 seconds and climbing. Help me diagnose if this is a resource bottleneck, network issue, or replication slot problem before it causes an outage.”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

2,277vs4,352

Latency

40.5smsvs1.0mms

Turns

5vs7

Tool Calls

1vs2

Schema Calls

0vs0

Response

3,044 charsvs4,120 chars

Evaluation

Without Schema

B+

With Schema

A-

Both responses provide comprehensive diagnostic queries covering replication lag, slots, and resource bottlenecks, but the response with Schema includes several critical additions. Most notably, it dedicates a section to checking for blocking queries on hot_standby replicas—a common but often-overlooked cause of replication lag. The with-Schema response also provides specific configuration parameters with actual values (hot_standby_feedback, max_standby_streaming_delay, wal_compression) rather than generic guidance, and includes more granular analysis distinguishing between WAL receive and replay gaps. The additional numeric thresholds (>10ms latency, >50% iowait) and clearer RED FLAGS formatting make it faster to act on during an incident.

Without Schema

1 tool calls

I'll help you diagnose this replication lag issue quickly. Let me create a diagnostic script that checks all three potential causes systematically.
I'll help you diagnose this immediately. Here are the key diagnostic queries to run:

## **1. Check Replication Lag (Run on PRIMARY)**

```sql
-- See current lag per replica
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS replay_lag_mb,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;
```

## **2. Check Replication Slots (Run on PRIMARY)**

```sql
-- Identify stuck or inactive slots
SELECT
    slot_name,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS retained_mb
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC;
```

**🚨 If you see > 5GB retained WAL with `active=false`, you have a stuck slot.**

## **3. Check Resource Bottlenecks (Run on REPLICA servers)**

```bash
# CPU and memory
top -b -n 1 | head -20
free -h

# Disk I/O wait
iostat -x 2 5

# Postgres replica process CPU
ps aux | grep "wal receiver\|startup" | grep -v grep

# Check if replica is in recovery
psql -c "SELECT pg_is_in_recovery();"

# Check replica apply rate
psql -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;"
```

## **4. Check WAL Generation Rate (Run on PRIMARY)**

```sql
-- High write activity can overwhelm replicas
SELECT
    datname,
    xact_commit + xact_rollback AS transactions_per_sec,
    blks_read + blks_hit AS blocks_accessed,
    tup_inserted + tup_updated + tup_deleted AS tuples_modified
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1')
ORDER BY tuples_modified DESC;
```

## **5. Check Network Latency**

```bash
# From replica to primary
ping -c 10 <PRIMARY_IP>

# Bandwidth test (if you have iperf)
iperf3 -c <PRIMARY_IP> -t 10
```

## **Quick Triage Questions:**

1. **Is `replay_lag` increasing linearly?** → Resource bottleneck on replica
2. **Is `retained_wal` growing with inactive slots?** → Replication slot problem
3. **Is `write_lag` << `replay_lag`?** → Replica can't apply changes fast enough (CPU/disk)
4. **Are all lags growing equally?** → Network issue

## **Immediate Actions Based on Root Cause:**

**If Resource Bottleneck:**
- Check disk I/O: `iostat -x 1` on replica
- Check for slow queries blocking recovery: `SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;` on replica

**If Replication Slot Problem:**
- Drop inactive slots: `SELECT pg_drop_replication_slot('slot_name');`
- Check if `max_wal_size` is too small

**If Network Issue:**
- Check firewall/security groups
- Verify network bandwidth between AZs/regions
- Check `wal_sender_timeout` and `wal_receiver_timeout` settings

What results do you see when you run the first two SQL queries? That will tell us which path to go down.

With Schema

2 tool calls

I'll help you diagnose this replication lag issue quickly. Let me create a diagnostic script to check all three potential causes systematically.
I'll provide you with immediate diagnostic queries and system checks. Given the urgency, here's what you need to run:

## **Quick Diagnosis - Run These Immediately**

### **1. Check Replication Lag (On PRIMARY)**

```sql
-- Shows current lag for all replicas
SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS replay_lag_mb,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY replay_lag DESC NULLS LAST;
```

**Look for:**
- `replay_lag` > 30 seconds and growing = **PROBLEM CONFIRMED**
- `state` should be 'streaming' (not 'catchup')
- Large `replay_lag_mb` values

---

### **2. Check Replication Slots (On PRIMARY)**

```sql
-- Identify stuck or inactive slots
SELECT
    slot_name,
    active,
    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS retained_wal_mb,
    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn) / 1024 / 1024 AS lag_mb
FROM pg_replication_slots
ORDER BY retained_wal_mb DESC;
```

**RED FLAGS:**
- `active = false` but slot exists = **REPLICATION SLOT PROBLEM**
- `retained_wal_mb` is very large (>10GB) = WAL accumulation, disk space risk
- Missing slots for known replicas

---

### **3. Resource Check on REPLICA**

```sql
-- Run this on the REPLICA
SELECT
    now() - pg_last_xact_replay_timestamp() AS replication_lag,
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS received,
    pg_last_wal_replay_lsn() AS replayed,
    pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) / 1024 / 1024 AS receive_replay_gap_mb;
```

**If `receive_replay_gap_mb` is large:** Replica can't replay fast enough = **RESOURCE BOTTLENECK on replica**

---

### **4. System Resource Check**

Run on **both PRIMARY and REPLICAS**:

```bash
# CPU usage
top -bn1 | grep "postgres:" | head -20

# Disk I/O
iostat -x 1 3

# Network between primary and replica
ping -c 10 <replica_ip>  # Run from primary
```

**Look for:**
- **Replica CPU at 100%** on wal receiver/startup process = CPU bottleneck
- **High iowait** (>50%) = Disk I/O bottleneck
- **Packet loss or high latency** (>10ms) = Network issue

---

### **5. Check for Blocking Queries (On REPLICA)**

```sql
-- Long-running queries can block replication on hot_standby replicas
SELECT
    pid,
    usename,
    state,
    now() - xact_start AS duration,
    wait_event_type,
    wait_event,
    query
FROM pg_stat_activity
WHERE state != 'idle'
    AND datname IS NOT NULL
ORDER BY duration DESC
LIMIT 10;
```

**If you see queries with `duration > 30s`:** Long-running analytical queries might be blocking WAL replay

---

## **Quick Fixes Based on Root Cause**

### **If REPLICATION SLOT problem:**
```sql
-- Check if slot is inactive
SELECT * FROM pg_replication_slots WHERE active = false;

-- Drop inactive slot (CAUTION: only if replica is dead)
-- SELECT pg_drop_replication_slot('slot_name');
```

### **If RESOURCE bottleneck on replica:**
```sql
-- Increase replication performance (on replica postgresql.conf)
-- hot_standby_feedback = on
-- max_standby_streaming_delay = 30s  # Lower if safe
-- wal_receiver_status_interval = 1s

-- Then restart replica
```

### **If NETWORK issue:**
- Check network saturation between primary and replicas
- Check if WAL compression is enabled: `wal_compression = on`
- Verify network path and firewall rules

### **If PRIMARY is overloaded:**
```sql
-- Check write load on primary
SELECT
    datname,
    tup_inserted + tup_updated + tup_deleted AS writes_per_sec
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');
```

---

## **Immediate Action Priority:**

1. **Run queries #1 and #2** to identify the bottleneck type
2. **Run query #3 on replica** to see if it's receiving vs replaying
3. **Check system resources** on the lagging replica
4. Report back what you find and I'll give you the specific fix

What do you see when you run these queries?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing replication lag, start by confirming where in the replication pipeline the bottleneck occurs (write, flush, or replay), then immediately check replica resource saturation (CPU, I/O, memory) since this is the most common cause. Next investigate blocking queries on the replica, network throughput issues, and finally WAL generation rate and replication slot problems.

1Identify where in the replication pipeline the lag is occurring

Check `replay-lag`, `flush-lag`, and `write-lag` in pg_stat_replication to pinpoint the bottleneck. If `replay-lag` is significantly higher than `write-lag`, the replica is receiving WAL but can't apply it fast enough — this points to replica resource or query blocking issues. If all three lag metrics are similar, suspect network or primary-side issues. The `replication-lag-stale-data` insight notes that lag over 30 seconds (which you're seeing) is critical and typically indicates the replica can't keep up with primary write volume.

Replication lag causes stale reads on replicas Replication lag causes stale replica data replay_lagflush_lagwrite_lagpostgresql.replication.replay_lagpostgresql.replication.flush_lag

2Check replica server resource saturation

This is the most common cause of replication lag. Check CPU utilization, disk I/O wait time, and memory pressure on the replica server. If the replica is undersized compared to the primary, it simply can't keep up with the WAL replay workload. Look for high iowait (>20% sustained) or CPU saturation (>80%) during periods of lag. Compare replica server specs to primary — replicas should generally match or exceed primary capacity since they need to apply changes serially while the primary can parallelize writes.

Replication lag causes stale reads on replicas Replication lag causes stale replica data replica undersized compared to primary replay_lagpostgresql.replication.replay_lag

3Identify long-running queries blocking WAL replay on the replica

Query pg_stat_activity on the replica for queries with state='active' and long query_start times. Long-running analytical queries or idle transactions can hold locks that prevent WAL replay from advancing, causing `replay-lag` to spike even when the replica has adequate resources. If you see queries running for minutes while replay lag climbs, that's your culprit. Consider setting hot_standby_feedback=off or using statement_timeout to prevent replicas from blocking themselves.

Replication lag causes stale reads on replicas Replication lag causes stale replica data replay_lagpostgresql.replication.replay_lag

4Verify network bandwidth and latency between primary and replica

Compare `sent-lsn` on the primary (from pg_stat_replication) to `write-lsn` and `flush-lsn` on the replica. If sent_lsn is far ahead of write_lsn, the network is the bottleneck. Also check `postgresql-replication-delay-bytes` — if it's growing while `write-lag` is high but `replay-lag` is low, you have network issues. Test bandwidth with iperf between nodes and check for packet loss. This is especially common in cross-region or cloud environments with bandwidth limits.

network latency or packet loss Replication lag causing stale reads on standby sent_lsnwrite_lsnflush_lsnwrite_lagreplay_lagreplication_delay_bytespostgresql.replication.data_delay

5Check WAL generation rate on the primary against replica capacity

Monitor `postgresql-wal-bytes` generation rate on the primary and compare to the replica's ability to consume it. If the primary is generating WAL faster than the replica can replay (due to heavy write bursts), `postgresql-replication-replay-lsn-delay` will grow continuously. The `replication-lag-cpu-saturation` insight shows this can happen when inefficient query plans on the primary cause CPU saturation and excessive WAL generation. Check if a VACUUM ANALYZE on the primary helps, or if you need to tune checkpoint settings (checkpoint_timeout, max_wal_size) to smooth out WAL generation.

Replication lag triggered by CPU saturation from bad query plans WAL generation rate exceeding checkpoint capacity postgresql.wal.bytespostgresql.replication.replay_lsn_delayreplay_lag

6Inspect replication slots and WAL accumulation

Check `replication-slot-safe-wal-size-size` and look for negative values or excessive WAL retention. A replication slot that's fallen far behind can prevent WAL cleanup, filling the disk and creating a cascading failure as described in `wal-accumulation-replication-lag`. Query pg_replication_slots for inactive slots with large restart_lsn differences. Also verify `postgresql-wal-receiver-connected` is 1 — if it's 0, the replica isn't even connected to receive WAL. If you find stuck slots, you may need to drop and recreate them, but be aware this risks data loss if the replica needs to catch up.

Excessive WAL file accumulation fills disk when replication lags or archiving fails replication slot preventing WAL cleanup replication_slot_safe_wal_sizewal_receiver_connectedsent_lsnreplay_lsn