Slow Replica Performance Investigation

warningIncident Response

Read replica is slower than expected for queries, even without replication lag.

Prompt: “Our PostgreSQL read replica has zero replication lag but queries are running 3x slower than on the primary. What could cause this performance difference and how do I diagnose it?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

944vs974

Latency

24.2smsvs21.3sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

2,246 charsvs2,492 chars

Evaluation

Without Schema

B+

With Schema

A-

Both responses cover fundamental causes like outdated statistics, configuration differences, and hardware variations, with similar diagnostic SQL queries. The response with Schema adds critical replica-specific considerations that the other misses: hot standby feedback issues, standby conflict monitoring via pg_stat_database_conflicts, and the important detail that replicas don't auto-analyze by default. It also provides more structured guidance with clearer action labels and explicitly notes that ANALYZE is safe on read-only replicas, reducing operator hesitation. The response without Schema offers solid general PostgreSQL performance advice but treats the replica scenario more generically, missing some replica-specific diagnostic paths that could be root causes.

Without Schema

Several factors can cause a read replica to perform slower than the primary despite having zero replication lag. Here's what to investigate:

## Common Causes

**1. Missing or Outdated Statistics**
- The replica's query planner may have stale statistics
- Check when `ANALYZE` last ran on both servers

**2. Configuration Differences**
- `shared_buffers`, `work_mem`, `effective_cache_size` may differ
- `random_page_cost` and other planner settings might not match

**3. Index Bloat or Missing Indexes**
- Indexes might be more bloated on the replica
- Indexes could be missing entirely if created after replication was set up

**4. Hardware/Resource Differences**
- Different disk I/O performance (especially if using slower disks)
- Less RAM allocated
- CPU differences

**5. Cache State**
- Primary has "hot" caches from repeated queries
- Replica caches may be cold or serving different query patterns

## Diagnostic Steps

```sql
-- 1. Compare query plans on both servers
EXPLAIN (ANALYZE, BUFFERS) 
SELECT ... your slow query ...;

-- 2. Check statistics freshness
SELECT schemaname, relname, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE schemaname = 'your_schema'
ORDER BY last_analyze;

-- 3. Compare key configuration settings
SELECT name, setting 
FROM pg_settings 
WHERE name IN ('shared_buffers', 'work_mem', 'effective_cache_size', 
               'random_page_cost', 'effective_io_concurrency')
ORDER BY name;

-- 4. Check for index bloat
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE schemaname = 'your_schema'
ORDER BY idx_scan DESC;

-- 5. Look for missing indexes
-- Compare \di output or pg_indexes on both servers
```

```bash
# 6. Check system-level differences
# CPU, memory, disk I/O
iostat -x 1
free -h
top
```

## Quick Fixes to Try

1. **Run ANALYZE**: `ANALYZE;` on the replica (or specific tables)
2. **Match configuration**: Copy relevant settings from primary's `postgresql.conf`
3. **Warm up caches**: Run common queries to populate buffer cache
4. **Check for missing indexes**: Compare index definitions between primary and replica

What type of queries are slow? OLTP (simple lookups) or OLAP (complex aggregations)? That can help narrow down the root cause.

With Schema

This is a common but often surprising issue. Here are the most likely causes and how to diagnose them:

## Common Causes

**1. Outdated Statistics / Missing Indexes**
- Replicas don't auto-analyze by default since they're read-only
- Query planner makes poor decisions without current statistics

**2. Configuration Differences**
- Different `work_mem`, `shared_buffers`, `effective_cache_size`
- Different planner settings (e.g., `random_page_cost`)

**3. Hardware/Resource Differences**
- Less RAM, slower disks, or fewer CPU cores on replica
- Shared hardware with other workloads

**4. Hot Standby Feedback Issues**
- Queries blocked by WAL replay conflicts
- Snapshot conflicts causing query cancellations/retries

## Diagnostic Steps

**1. Compare Query Plans**
```sql
-- On both primary and replica
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
```
Look for different plans (seq scan vs index scan), different row estimates, or buffer hit differences.

**2. Check Statistics Freshness**
```sql
-- On replica
SELECT schemaname, relname, last_analyze, last_autoanalyze,
       n_live_tup, n_mod_since_analyze
FROM pg_stat_user_tables
WHERE schemaname = 'your_schema'
ORDER BY n_mod_since_analyze DESC;
```

**Fix:** Manually analyze on replica (safe even in read-only mode):
```sql
ANALYZE;  -- Or ANALYZE specific_table;
```

**3. Compare Configurations**
```sql
-- Run on both servers
SELECT name, setting, source 
FROM pg_settings 
WHERE source != 'default' AND name IN (
  'shared_buffers', 'work_mem', 'maintenance_work_mem',
  'effective_cache_size', 'random_page_cost', 'seq_page_cost'
)
ORDER BY name;
```

**4. Check for Standby Conflicts**
```sql
-- On replica
SELECT * FROM pg_stat_database_conflicts WHERE datname = 'your_db';
```

**5. Monitor Resource Usage**
```bash
# Check CPU, memory, disk I/O on both servers
htop
iostat -x 1
```

**6. Compare Cache Hit Ratios**
```sql
SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit) as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio
FROM pg_statio_user_tables;
```

## Quick Fixes to Try

1. **Run ANALYZE on replica** (most common fix)
2. **Set `hot_standby_feedback = on`** on replica (in postgresql.conf) to reduce conflicts
3. **Match work_mem settings** between primary and replica
4. **Check if indexes exist on replica** (they should, but verify)

What type of queries are slow - analytical queries, OLTP, or specific problematic ones? That would help narrow down the diagnosis.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When a read replica shows no replication lag but queries run significantly slower than on the primary, start with configuration mismatches — this is the most common cause. Next, check buffer cache performance since cold caches or insufficient memory will degrade query speed dramatically. Finally, compare query plans and check for hot standby conflicts that might be forcing different execution strategies.

1Compare PostgreSQL configuration between primary and replica

First thing I check is whether the replica has the same `shared_buffers`, `work_mem`, `effective_cache_size`, and `random_page_cost` settings as the primary. A replica with 2GB `shared_buffers` when the primary has 8GB will perform dramatically worse even with identical hardware. Use `SHOW ALL` on both instances and diff the output, focusing on memory and planner settings. This is the most common cause of replica performance gaps.

configuration mismatch between primary and replica

2Check buffer cache hit ratios on both systems

Compare `buffer_hit` ratio between primary and replica — it should be >99% on both. If the replica's cache hit ratio is significantly lower, you've found your problem. Check `buffercache_summary_buffer_used` to see how much of shared_buffers is actually utilized. A cold cache on the replica (common after restarts or with different query patterns) will cause sequential scans to take 75+ seconds instead of sub-second on the primary.

Cache hit ratio below 99% indicates insufficient shared_buffers memory Cold buffer cache causes dramatic sequential scan performance degradation buffer_hithitsbuffercache_summary_buffer_used

3Run EXPLAIN ANALYZE for the same queries on both instances

Execute `EXPLAIN (ANALYZE, BUFFERS)` for your slow queries on both primary and replica and compare the plans side-by-side. Look for different join strategies, seq scans vs index scans, or dramatically different row estimates. Check the 'Buffers: shared read' values — high values indicate cache misses forcing disk I/O. Different plans often stem from stale statistics or different planner cost parameters.

different query plans on replica queries_duration_max

4Verify statistics are current on the replica

Check `stats_reset` time on both systems — if the replica's stats were recently reset or autovacuum isn't running properly, the planner lacks accurate data to choose optimal query plans. Run `SELECT schemaname, tablename, last_autoanalyze, last_analyze FROM pg_stat_user_tables` to see when tables were last analyzed. If critical tables haven't been analyzed recently, run `ANALYZE` manually and retest query performance.

statistics out of date stats_reset

5Check for hot standby query conflicts

Query `conflicts` metric to see if the replica is canceling queries due to WAL replay conflicts. High conflict counts mean queries are being interrupted for cleanup operations, forcing reruns and inflating total query time. If conflicts are frequent, consider increasing `max_standby_streaming_delay` on the replica or reducing `vacuum_defer_cleanup_age` on the primary to give queries more breathing room.

hot standby conflicts canceling queries conflicts

6Compare hardware capacity and utilization

If configuration and query plans match but performance still lags, the replica may simply be undersized. Compare CPU count, RAM, and disk I/O capacity (IOPS/throughput) between primary and replica. Monitor CPU utilization and disk I/O wait during query execution — if the replica is at 100% CPU or showing high I/O wait while the primary isn't, you need to scale up the replica to match the primary's resources.

replica undersized compared to primary