Query Plan Regression Detection

warningIncident Response

Previously fast queries have become slow due to query planner choosing suboptimal execution plans.

Prompt: “A query that used to run in 100ms is now taking 30 seconds, but the data volume hasn't changed much. How do I check if the query plan changed and why the optimizer might be making a bad choice now?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

969vs1,078

Latency

19.3smsvs21.3sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

2,717 charsvs2,853 chars

Evaluation

Without Schema

B+

With Schema

B+

Both responses provide comprehensive coverage of query plan regression diagnosis across PostgreSQL, MySQL, and SQL Server, with concrete SQL commands for capturing plans, checking statistics, and comparing execution history. The response without Schema offers more detailed explanations of plan differences to look for (sequential vs index scans, estimated vs actual row counts), while the response with Schema presents a more organized structure with numbered sections and a clearer diagnostic checklist. Neither response demonstrates unique insights that would indicate access to curated monitoring intelligence, and both defer to requesting more specifics (database system, actual query, EXPLAIN output) rather than providing a complete diagnostic methodology. The responses are substantively equivalent in technical accuracy and actionability.

Without Schema

Here's how to diagnose and fix query plan regression:

## Check Current vs Historical Query Plan

**PostgreSQL:**
```sql
-- Get current plan
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Check plan history (if pg_stat_statements enabled)
SELECT query, calls, mean_exec_time, stddev_exec_time 
FROM pg_stat_statements 
WHERE query LIKE '%your_query%';
```

**MySQL:**
```sql
-- Get current plan
EXPLAIN FORMAT=JSON SELECT ...;

-- Check execution stats
SELECT * FROM performance_schema.events_statements_history 
WHERE sql_text LIKE '%your_query%';
```

**SQL Server:**
```sql
-- Get current plan with stats
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
-- Your query here

-- Check plan cache
SELECT * FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle);
```

## Common Causes of Plan Regression

1. **Stale statistics** - Optimizer using outdated data distribution info
   ```sql
   -- PostgreSQL
   ANALYZE table_name;
   
   -- MySQL
   ANALYZE TABLE table_name;
   
   -- SQL Server
   UPDATE STATISTICS table_name WITH FULLSCAN;
   ```

2. **Missing/unused index** - Index was dropped, disabled, or optimizer ignores it
   - Check index usage stats
   - Look for "Seq Scan" instead of "Index Scan" in plan

3. **Data distribution skew** - New data patterns make cardinality estimates wrong
   - Check histograms
   - Consider filtered statistics for skewed columns

4. **Parameter sniffing** (SQL Server/prepared statements)
   - First execution cached a bad plan for atypical parameters
   - Fix: `OPTION (RECOMPILE)` or `OPTIMIZE FOR UNKNOWN`

5. **Configuration changes** - Cost parameters, work_mem, or optimizer settings changed

## Quick Diagnosis Steps

1. **Compare plans side by side** - Look for:
   - Seq Scan → Index Scan (or vice versa)
   - Hash Join → Nested Loop
   - Estimated rows vs actual rows (major mismatch = bad statistics)

2. **Check index health:**
   ```sql
   -- PostgreSQL: bloat & missing indexes
   SELECT schemaname, tablename, attname, n_distinct, correlation
   FROM pg_stats WHERE tablename = 'your_table';
   ```

3. **Force good plan temporarily:**
   - PostgreSQL: Use hints or `SET enable_seqscan=off` for testing
   - MySQL: `USE INDEX (index_name)`
   - SQL Server: Query hints like `WITH (INDEX(idx_name))`

## Fixes

- **Update statistics** (most common fix)
- **Create missing indexes** based on WHERE/JOIN clauses
- **Increase statistics target** for columns with high cardinality
- **Pin a good plan** (last resort) using plan guides/hints
- **Rewrite query** to help optimizer (e.g., break into CTEs, remove functions on indexed columns)

What database system are you using? I can give more specific commands if you share the actual query.

With Schema

I'll help you diagnose this query plan regression. Here's how to investigate:

## 1. **Capture and Compare Query Plans**

### PostgreSQL:
```sql
-- Get current plan
EXPLAIN (ANALYZE, BUFFERS) 
SELECT ...;

-- Check plan history (if pg_stat_statements enabled)
SELECT query, calls, mean_exec_time, stddev_exec_time
FROM pg_stat_statements 
WHERE query LIKE '%your_table%'
ORDER BY mean_exec_time DESC;
```

### MySQL:
```sql
-- Get current plan
EXPLAIN FORMAT=JSON 
SELECT ...;

-- Check slow query log for historical plans
-- Or use Performance Schema:
SELECT * FROM performance_schema.events_statements_history
WHERE sql_text LIKE '%your_query%';
```

### SQL Server:
```sql
-- Get current plan
SET STATISTICS IO, TIME ON;
SELECT ...;

-- Check plan cache
SELECT * FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp
WHERE st.text LIKE '%your_query%';
```

## 2. **Common Causes of Plan Regression**

Look for these issues:

**Stale Statistics:**
```sql
-- PostgreSQL
ANALYZE your_table;
SELECT schemaname, tablename, last_analyze, last_autoanalyze
FROM pg_stat_user_tables WHERE tablename = 'your_table';

-- MySQL
ANALYZE TABLE your_table;

-- SQL Server
UPDATE STATISTICS your_table WITH FULLSCAN;
```

**Missing/Dropped Indexes:**
```sql
-- Check indexes still exist
-- PostgreSQL: \d+ table_name
-- MySQL: SHOW INDEXES FROM table_name;
```

**Parameter Sniffing** (SQL Server):
```sql
-- Force recompile
OPTION (RECOMPILE)
-- Or clear plan cache
DBCC FREEPROCCACHE;
```

**Data Distribution Changes:**
- Histogram skew
- Cardinality estimate is now way off

## 3. **Quick Diagnostic Checklist**

Run these checks:

1. **Statistics age** - when were they last updated?
2. **Index availability** - are all expected indexes present?
3. **Plan differences** - compare old vs new (look for scan vs seek, nested loop vs hash join)
4. **Execution metrics** - logical reads, actual vs estimated rows
5. **Locks/blocking** - could contention be the issue?
6. **Parameter changes** - did query parameters change?

## 4. **Immediate Fixes**

Try these in order:

```sql
-- 1. Update statistics (safe, fast)
ANALYZE table_name;  -- PostgreSQL
ANALYZE TABLE table_name;  -- MySQL
UPDATE STATISTICS table_name;  -- SQL Server

-- 2. Force index hint (temporary workaround)
-- PostgreSQL
SELECT /*+ IndexScan(table_name idx_name) */ ...

-- MySQL
SELECT ... FROM table_name FORCE INDEX (idx_name) ...

-- SQL Server
SELECT ... FROM table_name WITH (INDEX(idx_name)) ...

-- 3. Disable specific optimization (PostgreSQL example)
SET enable_seqscan = off;  -- Force index usage
```

Would you like me to help investigate a specific query? If you share:
- The database system you're using
- The query that's slow
- The EXPLAIN plan output

I can provide more targeted diagnostics.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When a query suddenly slows down without data volume changes, the first thing to confirm is that the query plan actually changed by comparing EXPLAIN ANALYZE output. Most plan regressions are caused by stale statistics, so check mod_since_analyze and last_analyze timestamps next. After ruling out statistics issues, investigate whether indexes are being ignored, data distribution has shifted, or planner cost parameters need tuning.

1Capture and compare the current query plan

Run EXPLAIN (ANALYZE, BUFFERS) on the slow query to see the actual execution plan, estimated vs. actual row counts, and I/O patterns. If you have the old plan (from pg_stat_statements or logs), compare them side-by-side. Look for changed join orders, switched from index scan to sequential scan, or drastically wrong row estimates — these are smoking guns. The `postgresql-queries-shared-blks-read` metric will spike if the planner switched to sequential scans.

query plan regression join order changed index not being used postgresql.queries.shared_blks_read

2Check for stale statistics on the involved tables

Query pg_stat_all_tables for the tables in your slow query and check `n_mod_since_analyze` and `last_analyze`. If you see tens of thousands of modifications since the last ANALYZE, or last_analyze is days/weeks old, the planner is working with outdated distribution data. This is the most common cause of plan regressions — the planner thinks a table has 1000 rows when it now has 1M, or vice versa. Run ANALYZE manually on those tables and re-test the query.

Stale statistics cause inefficient query plans with exponentially worse performance statistics out of date need for ANALYZE n_mod_since_analyzelast_analyze

3Verify indexes are being used

Compare `postgresql.sequential_scans` vs `postgresql.index_rel_scans` for the tables in the query. If you see high sequential scans on tables with millions of rows, the planner is choosing table scans over indexes. Check if the query's WHERE clause can actually use your indexes — functional expressions, type mismatches, or OR conditions can prevent index usage. Also check if the index exists but statistics are stale (back to step 2).

index not being used planner choosing suboptimal plan postgresql.sequential_scanspostgresql.index_rel_scans

4Look for data distribution changes

Even with fresh statistics, if your data distribution has fundamentally changed (e.g., most rows now fall into one value after a bulk load, or you deleted 90% of a table), the planner's cost model might now prefer different plans. Check the actual row counts returned in EXPLAIN ANALYZE against the estimates — if estimates are off by 10x or more, the planner can't make good choices. Look at `postgresql.rows` trends over time to see if data volume spiked or dropped suddenly.

data distribution change affecting estimates postgresql.rows

5Review planner cost parameters

If statistics are fresh and indexes exist but the planner still chooses sequential scans, check random_page_cost, effective_cache_size, and seq_page_cost settings. On SSDs, random_page_cost should often be lowered to 1.1-1.5 (default is 4.0). If effective_cache_size is too low, the planner underestimates how much data fits in memory and favors sequential scans. Use EXPLAIN to see estimated costs, then tune parameters and re-EXPLAIN to see if the plan improves.

cost parameter tuning needed planner choosing suboptimal plan

6Monitor query I/O patterns over time

Track `postgresql.queries.shared_blks_read` for this specific query over time — a sudden jump indicates the planner switched from an index (reading few blocks) to a sequential scan (reading the entire table). Correlate this with `hits` vs `reads` ratio in pg_stat_database to see if the regression coincides with buffer pool pressure or a stats reset. If `stats_reset` is recent, you may have lost historical statistics that helped the planner make better choices.

query plan regression postgresql.queries.shared_blks_readreadshitsstats_reset