Slow Replica Performance Investigation

warningIncident Response

Read replica is slower than expected for queries, even without replication lag.

Prompt: Our PostgreSQL read replica has zero replication lag but queries are running 3x slower than on the primary. What could cause this performance difference and how do I diagnose it?

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
944vs974
Latency
24.2smsvs21.3sms
Turns
2vs2
Tool Calls
0vs0
Schema Calls
0vs0
Response
2,246 charsvs2,492 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses cover fundamental causes like outdated statistics, configuration differences, and hardware variations, with similar diagnostic SQL queries. The response with Schema adds critical replica-specific considerations that the other misses: hot standby feedback issues, standby conflict monitoring via pg_stat_database_conflicts, and the important detail that replicas don't auto-analyze by default. It also provides more structured guidance with clearer action labels and explicitly notes that ANALYZE is safe on read-only replicas, reducing operator hesitation. The response without Schema offers solid general PostgreSQL performance advice but treats the replica scenario more generically, missing some replica-specific diagnostic paths that could be root causes.
Without Schema
Several factors can cause a read replica to perform slower than the primary despite having zero replication lag. Here's what to investigate: ## Common Causes **1. Missing or Outdated Statistics** - The replica's query planner may have stale statistics - Check when `ANALYZE` last ran on both servers **2. Configuration Differences** - `shared_buffers`, `work_mem`, `effective_cache_size` may differ - `random_page_cost` and other planner settings might not match **3. Index Bloat or Missing Indexes** - Indexes might be more bloated on the replica - Indexes could be missing entirely if created after replication was set up **4. Hardware/Resource Differences** - Different disk I/O performance (especially if using slower disks) - Less RAM allocated - CPU differences **5. Cache State** - Primary has "hot" caches from repeated queries - Replica caches may be cold or serving different query patterns ## Diagnostic Steps ```sql -- 1. Compare query plans on both servers EXPLAIN (ANALYZE, BUFFERS) SELECT ... your slow query ...; -- 2. Check statistics freshness SELECT schemaname, relname, last_analyze, last_autoanalyze FROM pg_stat_user_tables WHERE schemaname = 'your_schema' ORDER BY last_analyze; -- 3. Compare key configuration settings SELECT name, setting FROM pg_settings WHERE name IN ('shared_buffers', 'work_mem', 'effective_cache_size', 'random_page_cost', 'effective_io_concurrency') ORDER BY name; -- 4. Check for index bloat SELECT schemaname, tablename, indexname, idx_scan FROM pg_stat_user_indexes WHERE schemaname = 'your_schema' ORDER BY idx_scan DESC; -- 5. Look for missing indexes -- Compare \di output or pg_indexes on both servers ``` ```bash # 6. Check system-level differences # CPU, memory, disk I/O iostat -x 1 free -h top ``` ## Quick Fixes to Try 1. **Run ANALYZE**: `ANALYZE;` on the replica (or specific tables) 2. **Match configuration**: Copy relevant settings from primary's `postgresql.conf` 3. **Warm up caches**: Run common queries to populate buffer cache 4. **Check for missing indexes**: Compare index definitions between primary and replica What type of queries are slow? OLTP (simple lookups) or OLAP (complex aggregations)? That can help narrow down the root cause.
With Schema
This is a common but often surprising issue. Here are the most likely causes and how to diagnose them: ## Common Causes **1. Outdated Statistics / Missing Indexes** - Replicas don't auto-analyze by default since they're read-only - Query planner makes poor decisions without current statistics **2. Configuration Differences** - Different `work_mem`, `shared_buffers`, `effective_cache_size` - Different planner settings (e.g., `random_page_cost`) **3. Hardware/Resource Differences** - Less RAM, slower disks, or fewer CPU cores on replica - Shared hardware with other workloads **4. Hot Standby Feedback Issues** - Queries blocked by WAL replay conflicts - Snapshot conflicts causing query cancellations/retries ## Diagnostic Steps **1. Compare Query Plans** ```sql -- On both primary and replica EXPLAIN (ANALYZE, BUFFERS) SELECT ...; ``` Look for different plans (seq scan vs index scan), different row estimates, or buffer hit differences. **2. Check Statistics Freshness** ```sql -- On replica SELECT schemaname, relname, last_analyze, last_autoanalyze, n_live_tup, n_mod_since_analyze FROM pg_stat_user_tables WHERE schemaname = 'your_schema' ORDER BY n_mod_since_analyze DESC; ``` **Fix:** Manually analyze on replica (safe even in read-only mode): ```sql ANALYZE; -- Or ANALYZE specific_table; ``` **3. Compare Configurations** ```sql -- Run on both servers SELECT name, setting, source FROM pg_settings WHERE source != 'default' AND name IN ( 'shared_buffers', 'work_mem', 'maintenance_work_mem', 'effective_cache_size', 'random_page_cost', 'seq_page_cost' ) ORDER BY name; ``` **4. Check for Standby Conflicts** ```sql -- On replica SELECT * FROM pg_stat_database_conflicts WHERE datname = 'your_db'; ``` **5. Monitor Resource Usage** ```bash # Check CPU, memory, disk I/O on both servers htop iostat -x 1 ``` **6. Compare Cache Hit Ratios** ```sql SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as ratio FROM pg_statio_user_tables; ``` ## Quick Fixes to Try 1. **Run ANALYZE on replica** (most common fix) 2. **Set `hot_standby_feedback = on`** on replica (in postgresql.conf) to reduce conflicts 3. **Match work_mem settings** between primary and replica 4. **Check if indexes exist on replica** (they should, but verify) What type of queries are slow - analytical queries, OLTP, or specific problematic ones? That would help narrow down the diagnosis.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When a read replica shows no replication lag but queries run significantly slower than on the primary, start with configuration mismatches — this is the most common cause. Next, check buffer cache performance since cold caches or insufficient memory will degrade query speed dramatically. Finally, compare query plans and check for hot standby conflicts that might be forcing different execution strategies.

1Compare PostgreSQL configuration between primary and replica
First thing I check is whether the replica has the same `shared_buffers`, `work_mem`, `effective_cache_size`, and `random_page_cost` settings as the primary. A replica with 2GB `shared_buffers` when the primary has 8GB will perform dramatically worse even with identical hardware. Use `SHOW ALL` on both instances and diff the output, focusing on memory and planner settings. This is the most common cause of replica performance gaps.
2Check buffer cache hit ratios on both systems
Compare `buffer_hit` ratio between primary and replica — it should be >99% on both. If the replica's cache hit ratio is significantly lower, you've found your problem. Check `buffercache_summary_buffer_used` to see how much of shared_buffers is actually utilized. A cold cache on the replica (common after restarts or with different query patterns) will cause sequential scans to take 75+ seconds instead of sub-second on the primary.
3Run EXPLAIN ANALYZE for the same queries on both instances
Execute `EXPLAIN (ANALYZE, BUFFERS)` for your slow queries on both primary and replica and compare the plans side-by-side. Look for different join strategies, seq scans vs index scans, or dramatically different row estimates. Check the 'Buffers: shared read' values — high values indicate cache misses forcing disk I/O. Different plans often stem from stale statistics or different planner cost parameters.
4Verify statistics are current on the replica
Check `stats_reset` time on both systems — if the replica's stats were recently reset or autovacuum isn't running properly, the planner lacks accurate data to choose optimal query plans. Run `SELECT schemaname, tablename, last_autoanalyze, last_analyze FROM pg_stat_user_tables` to see when tables were last analyzed. If critical tables haven't been analyzed recently, run `ANALYZE` manually and retest query performance.
5Check for hot standby query conflicts
Query `conflicts` metric to see if the replica is canceling queries due to WAL replay conflicts. High conflict counts mean queries are being interrupted for cleanup operations, forcing reruns and inflating total query time. If conflicts are frequent, consider increasing `max_standby_streaming_delay` on the replica or reducing `vacuum_defer_cleanup_age` on the primary to give queries more breathing room.
6Compare hardware capacity and utilization
If configuration and query plans match but performance still lags, the replica may simply be undersized. Compare CPU count, RAM, and disk I/O capacity (IOPS/throughput) between primary and replica. Monitor CPU utilization and disk I/O wait during query execution — if the replica is at 100% CPU or showing high I/O wait while the primary isn't, you need to scale up the replica to match the primary's resources.

Technologies

Related Insights

hot standby conflicts canceling queries
Stub insight created from scenario discovery hint: hot standby conflicts canceling queries
statistics out of date
Stub insight created from scenario discovery hint: statistics out of date
replica query performance degraded
diagnostic_step
Stub insight created from scenario discovery hint: replica query performance degraded
replica undersized compared to primary
prerequisite
Stub insight created from scenario discovery hint: replica undersized compared to primary
configuration mismatch between primary and replica
prerequisite
Stub insight created from scenario discovery hint: configuration mismatch between primary and replica
different query plans on replica
prerequisite
Stub insight created from scenario discovery hint: different query plans on replica
Cache hit ratio below 99% indicates insufficient shared_buffers memory
supporting_evidencewarning
Cold buffer cache causes dramatic sequential scan performance degradation
prerequisitewarning

Relevant Metrics

Monitoring Interfaces

PostgreSQL Datadog
PostgreSQL CloudWatch Metrics
PostgreSQL Prometheus
PostgreSQL Native
PostgreSQL Prometheus Exporter (pg_exporter)