Standby Query Conflict Resolution

warningIncident Response

Resolving conflicts on PostgreSQL read replicas when recovery processes conflict with long-running queries, causing query cancellations.

Prompt: “Our PostgreSQL read replica is canceling long-running queries with errors about conflicting recovery processes. I see bufferpin and snapshot conflicts in pg_stat_database_conflicts. How do I balance replication consistency with allowing analytical queries to complete?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,221vs812

Latency

26.3smsvs18.5sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

3,702 charsvs2,417 chars

Evaluation

Without Schema

A-

With Schema

B+

The response without Schema provides significantly more comprehensive coverage, including the critical old_snapshot_threshold parameter to prevent bloat from hot_standby_feedback, discussion of deprecated parameters, and multiple alternative strategies including logical replication. The response with Schema is more concise and action-oriented with a clearer 'quick fix' section, but omits important safeguards like bloat prevention mechanisms. Both correctly identify the core solutions (hot_standby_feedback and max_standby_streaming_delay), but the response without Schema's depth and completeness would give operators a fuller picture for making informed, sustainable decisions. The response with Schema's brevity could help operators act faster, though at the cost of missing important long-term considerations.

Without Schema

I'll help you address PostgreSQL read replica query cancellations due to recovery conflicts. This is a common challenge when running analytical queries on hot standby replicas.

## Understanding the Problem

Recovery conflicts occur when WAL replay on the standby needs to perform actions that conflict with running queries:
- **Snapshot conflicts**: Standby needs to remove rows that a query's snapshot can still see
- **Bufferpin conflicts**: Query holds a buffer pin on a page the standby needs to modify

## Key Parameters to Tune

### On the Primary:

**1. `hot_standby_feedback` (on standby)**
```sql
-- On the replica
ALTER SYSTEM SET hot_standby_feedback = on;
SELECT pg_reload_conf();
```
- Standby tells primary about oldest xmin, preventing premature cleanup
- **Trade-off**: Can cause bloat on primary if long queries run on standby

**2. `max_standby_streaming_delay` (on standby)**
```sql
-- On the replica
ALTER SYSTEM SET max_standby_streaming_delay = '5min';  -- or higher
SELECT pg_reload_conf();
```
- How long standby waits before canceling conflicting queries
- Default is 30s; increase for analytical workloads
- Set to `-1` for unlimited (risky - can stall replication)

**3. `vacuum_defer_cleanup_age` (on primary)**
```sql
-- On the primary (deprecated in PG 14+)
ALTER SYSTEM SET vacuum_defer_cleanup_age = 10000;
```
- Delays cleanup of dead tuples by N transactions
- **Better alternative**: Use replication slots with `hot_standby_feedback`

### On the Primary (Best Approach):

**4. `old_snapshot_threshold`**
```sql
-- On the primary
ALTER SYSTEM SET old_snapshot_threshold = '6h';
```
- Limits how long snapshots can exist
- Helps prevent unbounded bloat from `hot_standby_feedback`

## Recommended Strategy

**Best practice approach:**

```sql
-- On the PRIMARY:
ALTER SYSTEM SET old_snapshot_threshold = '4h';

-- On the REPLICA:
ALTER SYSTEM SET hot_standby_feedback = on;
ALTER SYSTEM SET max_standby_streaming_delay = '10min';

-- Reload both
SELECT pg_reload_conf();
```

## Alternative Strategies

**1. Dedicated Analytics Replica**
- Use `max_standby_streaming_delay = -1` on analytics replica
- Use separate replica with normal settings for failover
- Monitor replication lag carefully

**2. Query Timeout on Replica**
```sql
-- Set on replica for analytical queries
ALTER DATABASE analytics SET statement_timeout = '30min';
```

**3. Logical Replication (PG 10+)**
- Replicate only tables needed for analytics
- No recovery conflicts (but different consistency model)

## Monitoring

Check conflict statistics:
```sql
-- On replica
SELECT datname, 
       confl_snapshot, 
       confl_bufferpin, 
       confl_lock, 
       confl_deadlock
FROM pg_stat_database_conflicts
WHERE datname = current_database();

-- Check replication lag
SELECT 
    now() - pg_last_xact_replay_timestamp() AS replication_lag,
    pg_is_in_recovery();
```

## Trade-offs Summary

| Approach | Query Stability | Primary Bloat | Replication Lag Risk |
|----------|----------------|---------------|---------------------|
| `hot_standby_feedback = on` | High | Medium | Low |
| High `max_standby_streaming_delay` | Medium | Low | Medium |
| Both combined | Highest | Medium | Low-Medium |
| Dedicated analytics replica | Highest | Medium | High (on that replica) |

## Recommended Starting Point

For most analytical workloads, start with:
- `hot_standby_feedback = on` on replica
- `max_standby_streaming_delay = 5min` on replica  
- `old_snapshot_threshold = 4h` on primary (to limit bloat)

Then adjust based on your specific query patterns and replication lag tolerance.

Would you like help implementing any of these changes or setting up monitoring for your specific environment?

With Schema

This is a classic PostgreSQL read replica challenge. The conflicts occur when WAL replay needs to modify data that your long-running queries are still reading. Here are the key strategies to address this:

## Primary Configuration Settings

**1. Increase `max_standby_streaming_delay` on the replica:**
```sql
-- In postgresql.conf on the read replica
max_standby_streaming_delay = 300000  -- 5 minutes (in milliseconds)
-- Or set to -1 to never cancel queries (risks replica lag)
```

This delays WAL replay when it would conflict with queries. Balance this against acceptable replication lag.

**2. Enable `hot_standby_feedback` on the replica:**
```sql
-- In postgresql.conf on the read replica
hot_standby_feedback = on
```

This tells the primary to retain rows that the replica's queries still need, preventing vacuum from removing them. **Caveat**: Can cause bloat on the primary if analytical queries run very long.

## Additional Strategies

**3. Configure `max_standby_archive_delay`** (if using archive recovery):
```sql
max_standby_archive_delay = 300000
```

**4. Monitor conflicts:**
```sql
-- Check conflict statistics
SELECT * FROM pg_stat_database_conflicts 
WHERE datname = 'your_database';

-- Check current replication lag
SELECT 
    client_addr,
    state,
    sync_state,
    replay_lag,
    write_lag,
    flush_lag
FROM pg_stat_replication;  -- Run on primary
```

## Best Practices

**Recommended approach:**
1. **Enable `hot_standby_feedback = on`** - This is the most effective solution for most cases
2. **Set `max_standby_streaming_delay = 60000`** (1 minute) as a safety valve
3. **Monitor primary bloat** - Watch table/index sizes on primary
4. **Consider query timeouts** - Set `statement_timeout` for analytical queries to prevent indefinite delays

**Alternative: Dedicated analytics replica**
```sql
-- Create a separate replica specifically for analytics
-- Configure it with more aggressive delay settings
max_standby_streaming_delay = -1  -- Never cancel
hot_standby_feedback = on
```

This isolates analytical workload from operational replicas that need fresher data.

**Quick fix to apply now:**
```sql
-- On the replica (requires reload)
ALTER SYSTEM SET hot_standby_feedback = on;
ALTER SYSTEM SET max_standby_streaming_delay = 300000;
SELECT pg_reload_conf();
```

Would you like me to help you check your current settings or create monitoring queries for these conflicts?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing standby query conflicts on PostgreSQL read replicas, start by quantifying which conflict types are occurring most frequently, then identify the affected queries and evaluate your hot_standby settings. The key is finding the right balance between max_standby_streaming_delay (how long replay will wait for queries) and hot_standby_feedback (preventing conflicts but potentially causing primary bloat).

1Quantify the conflict types and frequency

Check `conflict-bufferpin` and `conflict-snapshot` in pg_stat_database_conflicts to understand which conflict type dominates. Buffer pin conflicts mean queries are holding buffers that recovery needs to modify, while snapshot conflicts mean the query's snapshot is too old for recovery to proceed. If you're seeing hundreds of conflicts per hour, this is severely impacting your analytical workload. The conflict type determines whether you need to adjust max_standby delays or address long-running transactions on the primary.

hot standby conflicts canceling queries confl_bufferpinconfl_snapshotconflicts

2Identify the canceled queries and their duration

Look at `postgresql-transactions-open` on the replica to see which queries are running when conflicts occur. If your analytical queries consistently run 30+ minutes, they'll conflict with the default max_standby_streaming_delay of 30 seconds. Check pg_stat_activity for query patterns — if these are scheduled ETL or reporting jobs, you have predictable conflict windows. Understanding query duration helps you decide whether to increase standby delays or optimize the queries themselves.

long-running transactions blocking replay transactions_open

3Review current hot_standby configuration parameters

Check max_standby_streaming_delay and max_standby_archive_delay settings — the default 30 seconds is way too low for analytical workloads. For read replicas serving long-running queries, consider increasing to 300 seconds (5 minutes) or even higher based on your RPO requirements. Also check if hot_standby_feedback is enabled — this prevents conflicts entirely but has important tradeoffs we'll evaluate next. These settings directly control how long recovery waits before canceling conflicting queries.

hot standby conflicts canceling queries conflicts

4Evaluate hot_standby_feedback tradeoffs

Enabling hot_standby_feedback prevents query cancellations but can cause serious problems on the primary. The `hot-standby-feedback-cpu-spikes` insight shows that 30+ minute queries on the replica will block VACUUM on the primary, causing bloat and CPU spikes. The `hot-standby-feedback-causing-bloat` issue means dead tuples accumulate on frequently updated tables. Only enable this if your replica queries are relatively short (under 10 minutes) or if you can add indexes to speed them up significantly. Monitor replication lag with `stream` after enabling.

Long-running queries on read replica with hot_standby_feedback cause periodic CPU spikes on primary hot standby feedback causing bloat stream_count

5Check for long-running transactions on the primary

Snapshot conflicts often originate from long-running transactions on the primary that haven't committed. Check `postgresql-transactions-open` on the primary database — idle transactions or forgotten BEGIN statements will cause the replica's snapshot to become invalid during recovery. Even if the primary transaction isn't actively running queries, an open transaction prevents VACUUM and causes snapshot conflicts on replicas. Find and terminate these transactions on the primary before adjusting replica settings.

long-running transactions blocking replay transactions_open

6Implement a balanced approach based on workload

For most analytical workloads, start by increasing max_standby_streaming_delay to 300-600 seconds rather than enabling hot_standby_feedback. This allows queries to complete without causing primary bloat. If conflicts persist and you must use hot_standby_feedback, add indexes to replica-queried tables (per the `hot-standby-feedback-cpu-spikes` insight) to reduce query duration below 10 minutes. Monitor `conflict` metrics after changes — you should see them drop to near zero. If you're running without replicas due to these issues, the `no-replica-eliminates-fast-recovery` insight reminds us that losing your replica eliminates failover capability, so fixing this is critical.

Long-running queries on read replica with hot_standby_feedback cause periodic CPU spikes on primary Operating without replica eliminates fast recovery path after failure conflictsstream_count