Storage Performance Bottleneck

warningIncident Response

Disk I/O latency or throughput limitations are impacting database performance.

Prompt: My PostgreSQL queries are slow and I see high disk I/O wait times. How do I determine if I need faster storage, more IOPS, or if there's a configuration issue I can fix?

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
1,561vs1,431
Latency
28.7smsvs29.5sms
Turns
2vs2
Tool Calls
0vs0
Schema Calls
0vs0
Response
3,690 charsvs3,907 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses provide comprehensive, actionable guidance with specific SQL queries and diagnostic commands. The response with Schema includes two critical elements for I/O wait diagnosis that the baseline misses: enabling `track_io_timing` to measure actual I/O time in queries, and analyzing wait events from `pg_stat_activity` to distinguish I/O waits from other bottlenecks. The baseline response provides more detailed checkpoint tuning guidance and more specific configuration examples (e.g., 'shared_buffers = 4GB' vs percentages). For the specific problem of diagnosing I/O wait times, the Schema-enhanced response is more directly targeted to the issue at hand.
Without Schema
I'll help you diagnose whether this is a hardware or configuration issue. Here's a systematic approach: ## 1. **Measure Current I/O Performance** First, understand what's actually happening: ```bash # Check I/O wait on Linux iostat -x 1 10 # Look for: # - %iowait (high = processes waiting for I/O) # - await (average wait time in ms) # - %util (disk utilization - sustained >80% is concerning) ``` ## 2. **PostgreSQL Statistics to Check** ```sql -- Check cache hit ratio (should be >99%) SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) as cache_hit_ratio FROM pg_statio_user_tables; -- Find slowest queries (requires pg_stat_statements extension) SELECT query, calls, total_exec_time, mean_exec_time, max_exec_time, shared_blks_read, shared_blks_hit FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10; -- Check for missing indexes (sequential scans on large tables) SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch, seq_tup_read / seq_scan as avg_seq_tuples FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_tup_read DESC LIMIT 20; ``` ## 3. **Configuration Issues to Check First** Before buying hardware, verify these settings in `postgresql.conf`: ```sql -- Check current settings SHOW shared_buffers; -- Should be 25% of RAM (start here) SHOW effective_cache_size; -- Should be 50-75% of RAM SHOW work_mem; -- Per-operation memory (2-4MB default, tune per query) SHOW maintenance_work_mem; -- For VACUUM, CREATE INDEX (256MB+) SHOW random_page_cost; -- Set to 1.1 for SSD, 4.0 for HDD (default) SHOW effective_io_concurrency; -- Set to 200 for SSD, 2 for HDD -- Checkpoint settings (aggressive checkpoints cause I/O spikes) SHOW checkpoint_timeout; SHOW checkpoint_completion_target; -- Should be 0.9 SHOW max_wal_size; -- Increase to reduce checkpoint frequency ``` **Common fixes:** ```ini # For systems with SSD and 16GB RAM: shared_buffers = 4GB effective_cache_size = 12GB work_mem = 16MB # Adjust based on connection count maintenance_work_mem = 1GB random_page_cost = 1.1 effective_io_concurrency = 200 # Reduce checkpoint I/O spikes: checkpoint_completion_target = 0.9 max_wal_size = 4GB ``` ## 4. **Check for Database Health Issues** ```sql -- Table bloat (requires VACUUM) SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size, n_dead_tup, n_live_tup, round(n_dead_tup * 100.0 / NULLIF(n_live_tup + n_dead_tup, 0), 2) as dead_pct FROM pg_stat_user_tables WHERE n_dead_tup > 1000 ORDER BY n_dead_tup DESC; -- Long-running queries blocking I/O SELECT pid, now() - query_start as duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; ``` ## 5. **When You Actually Need Hardware** You likely need more IOPS/faster storage if: ✅ Cache hit ratio is >95% but queries still slow ✅ Indexes are optimal but disk utilization is constantly >80% ✅ `await` times consistently >10ms even with tuned config ✅ Working set exceeds available RAM significantly ✅ You've already tuned all the above settings You can probably fix with configuration if: ❌ Cache hit ratio <90% ❌ Missing indexes on large tables ❌ Aggressive checkpoints (check logs for "checkpoint complete" every few minutes) ❌ `shared_buffers` or `work_mem` too small ❌ Tables haven't been vacuumed ## Quick Diagnostic Script Want me to create a comprehensive diagnostic script that checks all these areas and generates a report? It would help identify whether this is config or hardware.
With Schema
I'll help you diagnose your PostgreSQL I/O issues systematically. Here's how to determine if it's a hardware or configuration problem: ## 1. Check Current I/O Patterns **Identify slow queries with I/O stats:** ```sql -- Enable tracking if not already on ALTER SYSTEM SET track_io_timing = ON; SELECT pg_reload_conf(); -- Find queries doing heavy I/O SELECT query, calls, total_exec_time, blk_read_time, blk_write_time, shared_blks_read, shared_blks_hit, temp_blks_read, temp_blks_written FROM pg_stat_statements ORDER BY blk_read_time + blk_write_time DESC LIMIT 20; ``` **Check cache hit ratio:** ```sql SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, round(sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)), 3) as cache_hit_ratio FROM pg_statio_user_tables; ``` - **Below 0.99 (99%)** suggests insufficient memory or poor query patterns - **Above 0.99** but still slow suggests the physical reads that do occur are expensive ## 2. Identify Configuration Issues **Check key memory settings:** ```sql SHOW shared_buffers; SHOW effective_cache_size; SHOW work_mem; SHOW maintenance_work_mem; ``` Common configuration problems: - **shared_buffers too low**: Should be 25% of RAM (but not >40%) - **work_mem too low**: Causes disk-based sorts/hashes (check temp_blks_written above) - **effective_cache_size**: Should be 50-75% of RAM (doesn't allocate, just tells planner) **Check for missing indexes:** ```sql -- Sequential scans on large tables SELECT schemaname, tablename, seq_scan, seq_tup_read, idx_scan, seq_tup_read / seq_scan as avg_seq_tup FROM pg_stat_user_tables WHERE seq_scan > 0 ORDER BY seq_tup_read DESC LIMIT 20; ``` ## 3. Monitor Actual I/O Wait **At OS level:** ```bash # Linux - check iowait % iostat -x 2 5 # Look for: # - High %iowait (>10% consistently is concerning) # - High avgqu-sz (queue depth) # - High await times (>10ms for SSD, >20ms for HDD is slow) ``` **Within PostgreSQL:** ```sql -- Check for lock waits vs I/O waits SELECT wait_event_type, wait_event, count(*) FROM pg_stat_activity WHERE state = 'active' GROUP BY wait_event_type, wait_event ORDER BY count DESC; ``` - **DataFileRead** = reading data from disk - **WALWrite** = writing to WAL (write-heavy workload) - **DataFileExtend** = growing files (might benefit from preallocation) ## 4. Differentiate Hardware vs Configuration **It's likely a CONFIGURATION issue if:** - Cache hit ratio < 99% - Many temp_blks_written (work_mem too small) - Large sequential scans on indexed columns - High maintenance operations (VACUUM/ANALYZE) I/O - WAL on same disk as data **It's likely a HARDWARE issue if:** - Cache hit ratio > 99% but queries still slow - iostat shows sustained high await times even during simple queries - Random read IOPS near storage limits - Adequate indexes and configuration, but small physical reads take >10ms ## 5. Quick Fixes to Try First ```sql -- Increase work_mem for sorting/hashing SET work_mem = '64MB'; -- test per-session first -- Analyze for better query plans ANALYZE; -- Check for bloat requiring VACUUM SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size FROM pg_stat_user_tables WHERE n_dead_tup > n_live_tup * 0.1 ORDER BY n_dead_tup DESC; ``` ## Key Decision Points **Move to faster storage if:** - Storage IOPS consistently at limit (check cloud metrics or `iostat`) - Await times >10ms for SSD-backed storage - You've optimized queries, indexes, and config but still I/O bound - Random read workload exceeds storage capabilities **Tune configuration if:** - Memory settings are default/small - temp file writes are common - Missing indexes or bad query plans - VACUUM/ANALYZE haven't run recently What specific values are you seeing for cache hit ratio and `iostat` await times? That would help narrow down the issue.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing storage performance issues in PostgreSQL, start by verifying you're actually bottlenecked on disk rather than configuration. Check buffer cache hit ratio and temp file usage first — these often reveal tunable config problems that masquerade as storage issues. Only after ruling out memory and configuration problems should you measure actual I/O latency and determine if you need faster storage.

1Check buffer cache hit ratio to rule out memory issues
The first thing I check is `postgresql-buffer-hit` — if you're below 99%, you're doing excessive disk reads that could be avoided by increasing `shared_buffers`, not by buying faster storage. A 95% hit ratio means 5% of reads hit disk, which will feel like an I/O bottleneck even on fast NVMe. This is the most common false-positive for "slow storage" complaints. If your hit ratio is low and you have available RAM, bump shared_buffers before investigating storage.
2Look for temporary file usage indicating memory spills
Check `temp-size` to see if queries are spilling to disk because `work_mem` is too low. If you're writing gigabytes of temp files per hour, that's entirely avoidable I/O that has nothing to do with storage speed. Sorts, hash joins, and aggregations that don't fit in work_mem create massive disk churn. Increasing work_mem (carefully — it's per-operation) often eliminates what looks like a storage bottleneck.
3Measure actual I/O latency to confirm storage is the problem
Now measure real I/O latency using `postgresql-io-read-time` and `postgresql-io-write-time` divided by operation counts from `read`. If average read latency consistently exceeds 10ms or write latency exceeds 20ms, you likely have a genuine storage bottleneck. Compare these numbers to your storage tier's SLA — cloud volumes often have published IOPS and latency limits. This tells you if you've hit a hard limit or if something else is wrong.
4Analyze checkpoint write times for IOPS saturation
Look at `checkpoint-write-time` — if checkpoints are taking multiple seconds to write dirty buffers, you're likely hitting your IOPS limit. Cross-reference with checkpoint frequency (typically every 5-15 minutes by default). If checkpoints cause user-visible latency spikes, you can tune `checkpoint_completion_target` to spread writes more smoothly, but if that doesn't help, you're genuinely saturating storage and need more IOPS.
5Identify hot tables and indexes with per-object I/O stats
Drill into `blk-read-time`, `blk-write-time`, and `postgresql-disk-read` at the table/index level using pg_statio_user_tables and pg_statio_user_indexes. Often 80% of your I/O comes from 20% of your objects. If a few tables dominate, check for missing indexes (sequential scans on large tables), table bloat (vacuum issues), or hot spots that could be partitioned or cached differently. This reveals whether you need faster storage everywhere or just optimization for specific objects.
6Match storage tier to workload characteristics
Finally, analyze your I/O pattern: look at the ratio of `read-time` to `write-time` and the randomness of access. OLTP workloads need high random read/write IOPS (look for NVMe or provisioned IOPS SSD), while analytical or reporting workloads need high sequential throughput (throughput-optimized HDD may suffice). If you're on general-purpose SSD doing 50K random IOPS, you need a faster tier. If you're on HDD doing random I/O, you need a fundamentally different storage type.

Technologies

Related Insights

Relevant Metrics

Monitoring Interfaces

PostgreSQL CloudWatch Metrics
PostgreSQL Datadog
PostgreSQL Prometheus
PostgreSQL Native
PostgreSQL Prometheus Exporter (pg_exporter)
PostgreSQL Datadog Integration