Replication Lag Crisis

criticalIncident Response

Production read replica is falling behind master, causing stale data reads and potential application failures.

Prompt: “My PostgreSQL read replica on RDS is showing 45 seconds of replication lag and climbing. What's causing this and how do I fix it before it impacts users?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,013vs1,102

Latency

23.7smsvs23.0sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

2,391 charsvs3,074 chars

Evaluation

Without Schema

A-

With Schema

A-

Both responses provide comprehensive, technically accurate guidance with specific SQL queries, CloudWatch metrics, and concrete parameter changes. The response with Schema.ai access shows minimal differentiation, organizing causes 'in order of likelihood' and adding an AWS CLI monitoring command, but otherwise covers identical diagnostic territory with the same depth. Neither response demonstrates significant advantage—both would reliably put the operator on the right track with immediately actionable steps. The knowledge graph access does not appear to have added substantial value for this well-documented PostgreSQL replication lag scenario.

Without Schema

I'll help you diagnose and fix the replication lag. Here's a systematic approach:

## Immediate Diagnosis

**Check these key metrics in CloudWatch:**

1. **WriteIOPS and WriteThroughput on primary** - Is write load spiking?
2. **CPUUtilization on replica** - Is it maxed out?
3. **ReadIOPS on replica** - Is it disk I/O bound?
4. **NetworkReceiveThroughput** - Network bottleneck between primary and replica?

**Check for blocking queries on the replica:**

```sql
-- Connect to your replica and run:
SELECT pid, usename, query_start, state, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND backend_type = 'client backend'
ORDER BY query_start;
```

**Check replication status:**

```sql
-- On primary:
SELECT client_addr, state, sync_state, 
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
```

## Common Causes & Fixes

**1. Long-running queries on replica blocking replication**
- **Quick fix:** Kill the blocking queries: `SELECT pg_terminate_backend(pid);`
- **Long-term:** Increase `max_standby_streaming_delay` (default 30s)

**2. Replica undersized vs primary**
- **Fix:** Scale up the replica instance class to match or exceed primary

**3. High write volume on primary**
- **Quick mitigation:** Temporarily route read traffic to primary or reduce write load
- **Long-term:** Scale up both instances, add more read replicas to distribute load

**4. Replication conflict (query cancellations)**
- **Fix:** Set `hot_standby_feedback = on` on replica (prevents vacuum conflicts)
- **Warning:** Can cause bloat on primary if replica has long queries

**5. Disk I/O bottleneck**
- **Fix:** Upgrade to gp3 volumes or increase IOPS allocation

## Immediate Actions

1. **Check current parameter value:**
   ```sql
   SHOW max_standby_streaming_delay;
   ```

2. **If you see query conflicts in logs, increase delay:**
   - Modify parameter group: set `max_standby_streaming_delay = 300000` (5 min)
   - Apply immediately (may require reboot depending on parameter)

3. **Enable hot_standby_feedback if not already:**
   - Set `hot_standby_feedback = 1` in parameter group

4. **Monitor and adjust:**
   - Watch CloudWatch metric `ReplicaLag` 
   - Check RDS logs for replication conflicts

What specific symptoms are you seeing? Are there any long-running analytical queries on the replica, or is this a sudden spike in write traffic?

With Schema

I'll help you diagnose and fix this replication lag issue. Here are the most common causes and fixes:

## Immediate Diagnostic Steps

**Check CloudWatch metrics:**
- `ReadReplicaLag` - what you're seeing
- `CPUUtilization` - high CPU on replica?
- `WriteIOPS` / `ReadIOPS` - I/O bottleneck?
- `NetworkReceiveThroughput` - network saturation?
- `DatabaseConnections` - connection exhaustion?

**Common causes in order of likelihood:**

### 1. **Long-running queries on the replica** (most common)
```sql
-- Check for blocking queries on the replica
SELECT pid, usename, state, query_start, 
       now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state != 'idle' 
ORDER BY duration DESC;
```
**Fix:** Kill long-running SELECT queries that are blocking WAL replay

### 2. **Write-heavy workload on primary**
- Replica can't keep up with write volume
- **Quick fix:** Scale up replica instance class (more CPU/IOPS)
- **Better fix:** Reduce write load or add read-only connection pooling

### 3. **Network bottleneck**
- Cross-AZ replication consuming bandwidth
- **Fix:** Move replica to same AZ as primary (if acceptable for DR)

### 4. **Disk I/O saturation on replica**
```bash
# Check RDS Performance Insights for I/O waits
```
**Fix:** 
- Increase provisioned IOPS (if using io1/io2)
- Upgrade to larger instance with better I/O baseline
- Enable `hot_standby_feedback` to reduce conflicts

### 5. **Vacuum on primary creating WAL bloat**
- Large vacuum operations generate massive WAL
- **Fix:** Tune autovacuum settings, consider vacuumdb during maintenance windows

## Immediate Actions (Priority Order)

1. **Kill blocking queries on replica:**
```sql
SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE state != 'idle' 
  AND query_start < now() - interval '5 minutes';
```

2. **Enable hot_standby_feedback** (if not already):
```sql
-- On replica, edit parameter group:
hot_standby_feedback = on
```
This prevents query conflicts but can cause bloat on primary.

3. **Scale up replica instance** if CPU/IOPS maxed:
   - AWS Console → RDS → Modify → Choose larger instance class
   - Apply immediately for urgent situations

4. **Redirect read traffic** away from laggy replica temporarily

5. **Check for missing indexes:**
```sql
-- If replica is doing table scans that primary isn't
SELECT schemaname, tablename, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_scan DESC LIMIT 10;
```

## What to monitor now

```bash
# Watch lag in real-time
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name ReadReplicaLag \
  --dimensions Name=DBInstanceIdentifier,Value=your-replica-id \
  --start-time 2026-03-31T00:00:00Z \
  --end-time 2026-03-31T23:59:59Z \
  --period 60 \
  --statistics Average
```

**What's the current state?** Can you share:
- CloudWatch CPU/IOPS metrics for the replica?
- Output of the `pg_stat_activity` query?
- Is this lag steady at 45s or still climbing?

This will help narrow down whether it's query conflicts, resource exhaustion, or write volume.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating replication lag on PostgreSQL RDS read replicas, start by examining the lag type (write vs flush vs replay) to understand where the bottleneck is, then check replica resource constraints (CPU and I/O) which are the most common causes. After ruling out resource saturation, look for blocking queries on the replica, excessive WAL generation on the primary, and network or configuration issues.

1Identify which stage of replication is lagging

Check `replay_lag`, `flush_lag`, and `write_lag` metrics to understand where the bottleneck is. If `write_lag` is high but `flush_lag` and `replay_lag` are low, it's a network issue. If `replay_lag` is significantly higher than `write_lag`, the replica can't apply WAL fast enough — this points to CPU/I/O constraints or blocking queries on the replica. On RDS, use CloudWatch metrics for ReplicaLag and query `pg_stat_replication` on the primary to see these values.

Replication lag causes stale reads on replicas replication lag exceeding acceptable threshold replay_lagflush_lagwrite_lagreplication_lag_seconds

2Check replica CPU and disk I/O utilization

High CPU or saturated disk I/O on the replica is the most common cause of replication lag — the replica simply can't keep up with the primary's write volume. In RDS CloudWatch, check CPUUtilization (alert if >80% sustained) and ReadIOPS/WriteIOPS against your instance limits. If CPU is pegged, the replica may be undersized for the workload. I've seen cases where inefficient queries running on the replica (allowed via hot standby) compete with WAL replay for CPU cycles.

replica undersized for workload Replication lag triggered by CPU saturation from bad query plans replay_lagreplication_lag_seconds

3Find long-running queries blocking WAL replay on the replica

Query `SELECT pid, usename, state, query_start, state_change, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;` on the replica to find queries running for minutes or hours. Long-running SELECT queries can block WAL replay when the replay process needs to modify rows being read, causing `replay_lag` to climb while `write_lag` stays low. If you find blocking queries, consider setting `max_standby_streaming_delay` lower (default is 30s on RDS) or terminating the offending queries.

long-running transactions blocking replay Replication lag causes stale reads on replicas replay_lag

4Check WAL generation rate on the primary

If the primary is generating WAL faster than the replica can consume it, lag will accumulate. Monitor `postgresql.wal.bytes` rate on the primary — sudden spikes from bulk inserts, updates without indexes, or missing autovacuum can overwhelm replication. Compare the WAL generation rate to the replica's network throughput and disk write capacity. If WAL generation is consistently too high for your replica instance class, you may need to upgrade the replica or reduce write volume on the primary.

WAL generation rate exceeding checkpoint capacity postgresql.wal.bytessent_lsnpostgresql.replication.data_delay

5Verify network bandwidth and latency between primary and replica

Check RDS CloudWatch NetworkReceiveThroughput on the replica — if it's approaching instance limits (varies by instance type), you're saturating the network. Also check `sent_lsn` vs `flush_lsn` in `pg_stat_replication` to see how much data is in flight. High `postgresql.replication.data_delay` indicates network issues. In multi-AZ setups, cross-region replication can hit bandwidth limits or suffer from packet loss during network congestion.

Network Bandwidth Saturation network latency or packet loss sent_lsnpostgresql.replication.data_delaywrite_lag

6Review checkpoint settings and hot_standby_feedback configuration

If checkpoints on the replica are too infrequent or too small, WAL replay can't flush dirty pages fast enough, causing lag to accumulate. Check RDS parameter group settings for `checkpoint_timeout` and `max_wal_size`. Also verify if `hot_standby_feedback` is enabled — when on, it prevents the primary from vacuuming rows the replica might be reading, which can cause table bloat on the primary and ironically slow down replication. I typically disable it unless you have specific long-running analytical queries on replicas.

checkpoint tuning needed hot standby feedback causing bloat replay_lagflush_lag