Without Schema6 tool calls
I'll help you diagnose this PostgreSQL replication lag issue quickly. Let me first check if you have any relevant database configuration or monitoring scripts in your current directory, then provide you with diagnostic queries.
I'll create a comprehensive diagnostic script to help you identify the root cause. This is time-sensitive, so let's gather all the key metrics quickly.
I've created three diagnostic tools for you, but since this is urgent, here are the **immediate steps** you should take right now:
## 🚨 RUN THESE QUERIES IMMEDIATELY
### On your **REPLICA** database:
```sql
-- Check current lag and where the bottleneck is
SELECT
NOW() - pg_last_xact_replay_timestamp() AS replication_lag,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) / 1024 / 1024 AS receive_replay_delta_mb;
```
### On your **PRIMARY** database:
```sql
-- Check replication status and bottleneck location
SELECT
client_addr,
state,
pg_wal_lsn_diff(sent_lsn, write_lsn) / 1024 / 1024 AS pending_write_mb,
pg_wal_lsn_diff(write_lsn, flush_lsn) / 1024 / 1024 AS pending_flush_mb,
pg_wal_lsn_diff(flush_lsn, replay_lsn) / 1024 / 1024 AS pending_replay_mb
FROM pg_stat_replication;
```
## 📊 HOW TO INTERPRET RESULTS
**If `pending_replay_mb` is HIGH (>100MB):**
- **RESOURCE BOTTLENECK** - Replica CPU/disk can't keep up
- Check replica CPU: `top` or `htop`
- Check disk I/O: `iostat -x 2 5`
- **Quick fix:** Kill long-running queries on replica:
```sql
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state != 'idle' AND pid != pg_backend_pid()
AND query_start < NOW() - INTERVAL '5 minutes';
```
**If `pending_write_mb` or `pending_flush_mb` is HIGH:**
- **NETWORK ISSUE** - Data not reaching replica
- Test network: `ping <primary_ip>`
- Check PostgreSQL logs for connection errors
**If replica replication slot shows issues:**
```sql
-- On primary
SELECT slot_name, active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) / 1024 / 1024 AS retained_wal_mb
FROM pg_replication_slots;
```
- If `active = false` or `retained_wal_mb` > 5GB: **REPLICATION SLOT PROBLEM**
## ⚡ IMMEDIATE FIXES
**Most common cause (Resource Bottleneck):**
```sql
-- On replica: Disable hot_standby_feedback temporarily
ALTER SYSTEM SET hot_standby_feedback = off;
SELECT pg_reload_conf();
```
**Check for blocking queries:**
```sql
-- On replica
SELECT pid, state, wait_event_type, NOW() - query_start AS runtime
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start LIMIT 10;
```
Run these diagnostic commands and share the results - I can then pinpoint the exact issue and provide specific fixes!
I'll help you diagnose this replication lag issue. Let me start by checking your environment and then provide diagnostic queries to identify the root cause.
Given the urgency of your replication lag issue, let me give you immediate diagnostic steps and solutions:
## **Quick Diagnosis - Run These Now:**
### **On PRIMARY server:**
```sql
-- 1. Check current lag and identify bottleneck type
SELECT
client_addr,
application_name,
state,
write_lag,
flush_lag,
replay_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS lag_size
FROM pg_stat_replication;
```
### **On REPLICA server:**
```sql
-- 2. Identify if it's network vs. apply bottleneck
SELECT
pg_last_wal_receive_lsn() AS received,
pg_last_wal_replay_lsn() AS replayed,
pg_size_pretty(pg_wal_lsn_diff(
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn()
)) AS backlog;
```
**Interpret the results:**
- **If `backlog` is large (>100MB)**: **CPU/Disk I/O bottleneck on replica** - WAL is arriving but not being applied fast enough
- **If `backlog` is small/zero**: **Network issue** - WAL isn't reaching the replica fast enough
---
## **Three Root Causes & Solutions:**
### **A. Network Bottleneck** (backlog = 0 on replica)
Check bandwidth saturation between primary and replica:
```bash
# On replica, test latency to primary
ping -c 20 <primary-ip>
# Check for packet loss
mtr -c 100 <primary-ip>
```
**Immediate fixes:**
```sql
-- On primary, enable WAL compression
ALTER SYSTEM SET wal_compression = on;
SELECT pg_reload_conf();
```
### **B. CPU/Disk Bottleneck on Replica** (backlog > 0)
```bash
# Check disk I/O wait
iostat -x 1 5
# Check CPU usage
top -b -n 1 | grep postgres
```
**Check for recovery conflicts:**
```sql
-- On replica
SELECT * FROM pg_stat_database_conflicts;
```
**Immediate fixes:**
```sql
-- On replica, enable hot standby feedback to reduce conflicts
ALTER SYSTEM SET hot_standby_feedback = on;
SELECT pg_reload_conf();
```
### **C. Replication Slot Problem**
```sql
-- On primary, check slot status
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
```
**If slot is inactive:**
- Replica crashed/disconnected - check replica PostgreSQL logs
- WAL accumulating on primary disk - monitor disk space urgently
---
## **Emergency Action (if lag >5 min):**
```sql
-- On primary: Check WAL generation rate
SELECT pg_current_wal_lsn();
-- Wait 10 seconds, then run again to calculate bytes/sec
-- If primary disk filling up, check space:
SELECT pg_size_pretty(sum(size)) AS wal_size
FROM pg_ls_waldir();
```
**Critical decision point:**
- If primary disk <10% free: You may need to drop the replication slot and rebuild replica
- Otherwise: Focus on fixing replica bottleneck
Run these diagnostics and share the results - I'll help you pinpoint the exact cause and solution.