Operational readiness review

infoProactive Health

Reviewing whether a long-running PostgreSQL deployment is well configured and properly provisioned as workloads have evolved over time.

Prompt: “I set up Postgres a while ago. The workload has evolved a lot since then. How do I make sure it's well configured and provisioned?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,651vs2,964

Latency

33.2smsvs1.4mms

Turns

11vs32

Tool Calls

6vs16

Schema Calls

0vs11

Response

3,346 charsvs2,887 chars

Evaluation

Without Schema

B+

With Schema

A-

Both responses provide technically sound guidance with specific SQL queries and configuration recommendations. The response without Schema is more encyclopedic, covering performance metrics, configuration, resources, tools, and provisioning in comprehensive detail. The response with Schema is more focused and prioritized, emphasizing memory configuration and autovacuum tuning as the most critical areas for evolved workloads, and includes valuable context like the work_mem multiplication warning and specific thresholds (>99% cache hit ratio vs >95%). The Schema response better addresses the "evolved workload" scenario by explaining *why* metrics matter and what they indicate, making it more likely to get the operator on the right track quickly, though it provides less detail on provisioning aspects like CPU, storage, and replication that the baseline response covers more thoroughly.

Without Schema

6 tool calls

I'll help you assess your Postgres configuration and provisioning. Let me first check what Postgres-related setup you have in this directory.
I don't see any Postgres configuration files in this directory, so I'll provide you with a comprehensive approach to assess and optimize your Postgres setup:

## 1. **Assess Current Performance & Metrics**

First, connect to your database and gather key metrics:

```sql
-- Check database size and growth
SELECT pg_database.datname, 
       pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

-- Check table sizes and bloat
SELECT schemaname, tablename,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
       pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS index_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 20;

-- Check cache hit ratio (should be >95%)
SELECT 
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit) as heap_hit,
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio
FROM pg_statio_user_tables;

-- Check for slow queries
SELECT query, calls, total_exec_time, mean_exec_time, max_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
```

## 2. **Review Configuration Parameters**

Key settings to review based on your workload:

```sql
-- Check current settings
SHOW shared_buffers;
SHOW effective_cache_size;
SHOW maintenance_work_mem;
SHOW work_mem;
SHOW max_connections;
SHOW random_page_cost;
```

**Common adjustments:**
- `shared_buffers`: 25% of total RAM (for dedicated DB server)
- `effective_cache_size`: 50-75% of total RAM
- `work_mem`: Total RAM / (max_connections * 2-3)
- `maintenance_work_mem`: 256MB - 2GB depending on RAM
- `random_page_cost`: 1.1 for SSD, 4.0 for HDD
- `checkpoint_completion_target`: 0.9
- `wal_buffers`: 16MB
- `default_statistics_target`: 100-500 for complex queries

## 3. **Check Resource Utilization**

Monitor your system:
- **CPU**: Check if queries are CPU-bound
- **Memory**: Ensure no swapping occurs
- **Disk I/O**: Check IOPS and latency
- **Connections**: Monitor connection pool usage

## 4. **Use Analysis Tools**

- **PgTune**: Generate config recommendations based on hardware
- **pg_stat_statements**: Analyze query performance
- **EXPLAIN ANALYZE**: Profile slow queries
- **pg_stat_bgwriter**: Check checkpoint performance
- **Monitoring**: Set up Prometheus + postgres_exporter, or tools like Datadog, New Relic

## 5. **Provisioning Checklist**

- [ ] Storage: Enough IOPS for your workload? Consider provisioned IOPS if on cloud
- [ ] CPU: Are you seeing sustained high CPU? May need more cores
- [ ] Memory: Is cache hit ratio >95%? If not, add more RAM
- [ ] Connections: Review connection pooling (PgBouncer, connection limits)
- [ ] Replication: Do you need read replicas for read-heavy workloads?

Would you like help with any specific aspect? For example, I can help you:
- Write monitoring queries for your specific workload
- Generate optimized postgresql.conf based on your server specs
- Analyze slow query logs
- Set up automated performance monitoring

With Schema

16 tool calls11 Schema

I'll help you review your PostgreSQL configuration and provisioning. Let me get expert guidance on this.
Based on expert guidance for PostgreSQL workload evolution, here's how to ensure your setup is well-configured:

## **Key Areas to Review**

### **1. Memory Configuration**
Your core memory parameters likely need adjustment:

- **`shared_buffers`** - Set to 25-40% of total RAM for dedicated servers (defaults are often too conservative)
- **`work_mem`** - Controls memory for sorts/joins per operation. Too low causes disk spills (temp files)
- **`maintenance_work_mem`** - For VACUUM/index operations. Increase to 256MB-2GB depending on table sizes
- **`effective_cache_size`** - Set to 50-75% of total RAM (hints to query planner about available OS cache)

### **2. Critical Metrics to Check**

**Buffer cache hit ratio:**
```sql
SELECT 
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio
FROM pg_stat_user_tables;
```
- **Target: >99%** (95%+ acceptable, <95% indicates undersized `shared_buffers`)

**Temp files (indicates memory pressure):**
```sql
SELECT datname, temp_files, temp_bytes 
FROM pg_stat_database;
```
- High temp file counts mean `work_mem` is too low, forcing sorts/joins to disk

### **3. Autovacuum Tuning**
Critical for evolved workloads:

- **`autovacuum_work_mem`** - If tables are large, increase to avoid multiple index vacuum phases (default 64MB often too low)
- **`autovacuum_max_workers`** - Default of 3 is usually sufficient (more workers share the cost limit, making each slower)
- **`autovacuum_vacuum_cost_limit`** - Increase to make vacuum faster (balances against production load)

### **4. Configuration Audit Checklist**

1. **Check current settings:**
   ```sql
   SHOW shared_buffers;
   SHOW work_mem;
   SHOW maintenance_work_mem;
   SHOW effective_cache_size;
   ```

2. **Monitor for memory pressure signs:**
   - Temp file creation (indicates low `work_mem`)
   - Low cache hit ratio (indicates low `shared_buffers`)
   - Slow VACUUM with multiple index phases (low `autovacuum_work_mem`)

3. **Review per-table autovacuum settings** for high-churn tables

### **Warning: Multiplication Effect**

`work_mem` is allocated **per operation per connection**. With many connections running complex queries, total memory = `work_mem × operations × max_connections`. Start conservative and increase gradually.

**Sources:**
- [PostgreSQL Performance Tuning Guide](https://dev.to/pedrohgoncalves/diagnosing-and-fixing-critical-postgresql-performance-issues-a-deep-dive-3jj)
- [Debugging Autovacuum Problems](https://www.citusdata.com/blog/2022/07/28/debugging-postgres-autovacuum-problems-13-tips)
- [pganalyze VACUUM Advisor](https://pganalyze.com/blog/introducing-vacuum-advisor-postgres)

Want me to help you analyze specific metrics or generate configuration recommendations based on your instance size?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

For an operational readiness review of an aging PostgreSQL deployment, start by establishing baseline health and resource utilization trends, then systematically check the three areas most likely to degrade as workloads evolve: memory/cache efficiency, connection capacity, and vacuum/maintenance health. Finish by assessing operational risks like transaction ID wraparound and replication lag.

1Establish baseline health and resource utilization trends

Start by reviewing `postgresql-uptime` to check for unexpected restarts, then plot `postgresql-backends` against `postgresql-connection-max` over the past 3-6 months to understand connection growth. Look at `postgresql-database-size` growth rate to project future capacity needs. This baseline tells you whether the system is stable and how demand has changed since initial deployment — if backends have grown from 20 to 200, your original configuration assumptions are likely outdated.

overall instance health status workload characteristics and trends uptimepostgresql.backendspostgresql.connection.maxpostgresql.database.size

2Assess memory configuration and cache efficiency

Calculate your cache hit ratio by comparing `postgresql-blocks-hit` to `postgresql-blocks-read` — the ratio should be above 99%. If it's lower, you're hitting disk too often and shared_buffers is undersized for your current working set. This is the most common configuration drift as workloads grow: what was adequate memory for 10GB of hot data won't work for 50GB. Start with setting shared_buffers to 25% of total RAM if you're below that threshold.

Cache hit ratio below 99% indicates undersized shared_buffers or high disk I/O configuration tuning needed postgresql.blocks_hitpostgresql.blocks_readbuffer_hit

3Review connection capacity and pooling adequacy

Check if `postgresql-backends` is consistently above 80% of `postgresql-connection-max` — this indicates you're running close to connection exhaustion. Use `postgresql-connections-by-process` to identify which applications are consuming connections and whether they're properly pooling. In my experience, workload evolution often means more microservices or workers connecting, and initial max_connections settings (often defaulted to 100) become dangerously inadequate. Implement PgBouncer or similar connection pooling if you're not already using it.

Connection count approaching max_connections limit capacity headroom adequacy postgresql.backendspostgresql.connection.maxconnections_by_process

4Evaluate vacuum and maintenance health

Review `dead-tuple` (n_dead_tup) counts and calculate dead tuple ratio — if it's above 10% on high-activity tables, autovacuum is falling behind. Check `postgresql-last-vacuum` and `postgresql-autovacuumed` timestamps to see if tables are going days without maintenance, and review `postgresql-table-bloat` percentage. As workloads grow from occasional updates to high-frequency writes, default autovacuum settings (vacuum_scale_factor=0.2) often can't keep up, leading to bloat that degrades query performance and wastes disk space.

Autovacuum falling behind on tables causes bloat accumulation High dead tuple ratio causes table bloat and slow queries operational best practice adherence n_dead_tuppostgresql.dead_rowspostgresql.last_vacuumautovacuumedtable_bloat

5Check for statistics staleness and query planning issues

If `postgresql-queries-duration-max` has increased over time without obvious application changes, check whether ANALYZE is running regularly — stale statistics cause the planner to choose inefficient execution plans. Also review `postgresql-temp-bytes`: high temporary file usage (>1GB/hour) often indicates queries spilling to disk due to poor plans or undersized work_mem. The statistics that worked for your initial small dataset won't represent current data distributions after months of growth.

Stale statistics cause poor query plans configuration tuning needed queries_duration_maxpostgresql.temp_bytes

6Assess critical operational risks and safety margins

Check `postgresql-before-xid-wraparound` — if you're below 10 million transactions remaining, you have a critical wraparound risk that requires immediate vacuum intervention. Review `replication-lag` if you have replicas (lag >60s indicates capacity issues), and check for long-running transactions via `postgresql-transactions-open` and excessive lock contention in `postgresql-locks`. These operational risks often emerge gradually as workloads evolve but can cause sudden, catastrophic failures if ignored.

reliability risk assessment operational best practice adherence before_xid_wraparoundreplication_lag_secondstransactions_openpostgresql.locks

7Review checkpoint behavior and I/O patterns

Examine `checkpoint-timed` frequency — if checkpoints are happening too frequently (every 1-2 minutes), you may need to increase checkpoint_timeout or max_wal_size. Frequent checkpoints cause I/O spikes that hurt performance. Correlate checkpoint timing with `postgresql-blocks-read` spikes to understand I/O patterns. As transaction volume grows, checkpoint settings that worked initially may now cause performance degradation during checkpoint storms.

configuration tuning needed checkpoint_timedpostgresql.blocks_read