Operational readiness review

infoProactive Health

Reviewing whether a long-running PostgreSQL deployment is well configured and properly provisioned as workloads have evolved over time.

Prompt: I set up Postgres a while ago. The workload has evolved a lot since then. How do I make sure it's well configured and provisioned?

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
1,651vs2,964
Latency
33.2smsvs1.4mms
Turns
11vs32
Tool Calls
6vs16
Schema Calls
0vs11
Response
3,346 charsvs2,887 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses provide technically sound guidance with specific SQL queries and configuration recommendations. The response without Schema is more encyclopedic, covering performance metrics, configuration, resources, tools, and provisioning in comprehensive detail. The response with Schema is more focused and prioritized, emphasizing memory configuration and autovacuum tuning as the most critical areas for evolved workloads, and includes valuable context like the work_mem multiplication warning and specific thresholds (>99% cache hit ratio vs >95%). The Schema response better addresses the "evolved workload" scenario by explaining *why* metrics matter and what they indicate, making it more likely to get the operator on the right track quickly, though it provides less detail on provisioning aspects like CPU, storage, and replication that the baseline response covers more thoroughly.
Without Schema
6 tool calls
I'll help you assess your Postgres configuration and provisioning. Let me first check what Postgres-related setup you have in this directory. I don't see any Postgres configuration files in this directory, so I'll provide you with a comprehensive approach to assess and optimize your Postgres setup: ## 1. **Assess Current Performance & Metrics** First, connect to your database and gather key metrics: ```sql -- Check database size and growth SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database ORDER BY pg_database_size(pg_database.datname) DESC; -- Check table sizes and bloat SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size, pg_size_pretty(pg_relation_size(schemaname||'.'||tablename)) AS table_size, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename) - pg_relation_size(schemaname||'.'||tablename)) AS index_size FROM pg_tables ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC LIMIT 20; -- Check cache hit ratio (should be >95%) SELECT sum(heap_blks_read) as heap_read, sum(heap_blks_hit) as heap_hit, sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS ratio FROM pg_statio_user_tables; -- Check for slow queries SELECT query, calls, total_exec_time, mean_exec_time, max_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20; ``` ## 2. **Review Configuration Parameters** Key settings to review based on your workload: ```sql -- Check current settings SHOW shared_buffers; SHOW effective_cache_size; SHOW maintenance_work_mem; SHOW work_mem; SHOW max_connections; SHOW random_page_cost; ``` **Common adjustments:** - `shared_buffers`: 25% of total RAM (for dedicated DB server) - `effective_cache_size`: 50-75% of total RAM - `work_mem`: Total RAM / (max_connections * 2-3) - `maintenance_work_mem`: 256MB - 2GB depending on RAM - `random_page_cost`: 1.1 for SSD, 4.0 for HDD - `checkpoint_completion_target`: 0.9 - `wal_buffers`: 16MB - `default_statistics_target`: 100-500 for complex queries ## 3. **Check Resource Utilization** Monitor your system: - **CPU**: Check if queries are CPU-bound - **Memory**: Ensure no swapping occurs - **Disk I/O**: Check IOPS and latency - **Connections**: Monitor connection pool usage ## 4. **Use Analysis Tools** - **PgTune**: Generate config recommendations based on hardware - **pg_stat_statements**: Analyze query performance - **EXPLAIN ANALYZE**: Profile slow queries - **pg_stat_bgwriter**: Check checkpoint performance - **Monitoring**: Set up Prometheus + postgres_exporter, or tools like Datadog, New Relic ## 5. **Provisioning Checklist** - [ ] Storage: Enough IOPS for your workload? Consider provisioned IOPS if on cloud - [ ] CPU: Are you seeing sustained high CPU? May need more cores - [ ] Memory: Is cache hit ratio >95%? If not, add more RAM - [ ] Connections: Review connection pooling (PgBouncer, connection limits) - [ ] Replication: Do you need read replicas for read-heavy workloads? Would you like help with any specific aspect? For example, I can help you: - Write monitoring queries for your specific workload - Generate optimized postgresql.conf based on your server specs - Analyze slow query logs - Set up automated performance monitoring
With Schema
16 tool calls11 Schema
I'll help you review your PostgreSQL configuration and provisioning. Let me get expert guidance on this. Based on expert guidance for PostgreSQL workload evolution, here's how to ensure your setup is well-configured: ## **Key Areas to Review** ### **1. Memory Configuration** Your core memory parameters likely need adjustment: - **`shared_buffers`** - Set to 25-40% of total RAM for dedicated servers (defaults are often too conservative) - **`work_mem`** - Controls memory for sorts/joins per operation. Too low causes disk spills (temp files) - **`maintenance_work_mem`** - For VACUUM/index operations. Increase to 256MB-2GB depending on table sizes - **`effective_cache_size`** - Set to 50-75% of total RAM (hints to query planner about available OS cache) ### **2. Critical Metrics to Check** **Buffer cache hit ratio:** ```sql SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS cache_hit_ratio FROM pg_stat_user_tables; ``` - **Target: >99%** (95%+ acceptable, <95% indicates undersized `shared_buffers`) **Temp files (indicates memory pressure):** ```sql SELECT datname, temp_files, temp_bytes FROM pg_stat_database; ``` - High temp file counts mean `work_mem` is too low, forcing sorts/joins to disk ### **3. Autovacuum Tuning** Critical for evolved workloads: - **`autovacuum_work_mem`** - If tables are large, increase to avoid multiple index vacuum phases (default 64MB often too low) - **`autovacuum_max_workers`** - Default of 3 is usually sufficient (more workers share the cost limit, making each slower) - **`autovacuum_vacuum_cost_limit`** - Increase to make vacuum faster (balances against production load) ### **4. Configuration Audit Checklist** 1. **Check current settings:** ```sql SHOW shared_buffers; SHOW work_mem; SHOW maintenance_work_mem; SHOW effective_cache_size; ``` 2. **Monitor for memory pressure signs:** - Temp file creation (indicates low `work_mem`) - Low cache hit ratio (indicates low `shared_buffers`) - Slow VACUUM with multiple index phases (low `autovacuum_work_mem`) 3. **Review per-table autovacuum settings** for high-churn tables ### **Warning: Multiplication Effect** `work_mem` is allocated **per operation per connection**. With many connections running complex queries, total memory = `work_mem × operations × max_connections`. Start conservative and increase gradually. **Sources:** - [PostgreSQL Performance Tuning Guide](https://dev.to/pedrohgoncalves/diagnosing-and-fixing-critical-postgresql-performance-issues-a-deep-dive-3jj) - [Debugging Autovacuum Problems](https://www.citusdata.com/blog/2022/07/28/debugging-postgres-autovacuum-problems-13-tips) - [pganalyze VACUUM Advisor](https://pganalyze.com/blog/introducing-vacuum-advisor-postgres) Want me to help you analyze specific metrics or generate configuration recommendations based on your instance size?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

For an operational readiness review of an aging PostgreSQL deployment, start by establishing baseline health and resource utilization trends, then systematically check the three areas most likely to degrade as workloads evolve: memory/cache efficiency, connection capacity, and vacuum/maintenance health. Finish by assessing operational risks like transaction ID wraparound and replication lag.

1Establish baseline health and resource utilization trends
Start by reviewing `postgresql-uptime` to check for unexpected restarts, then plot `postgresql-backends` against `postgresql-connection-max` over the past 3-6 months to understand connection growth. Look at `postgresql-database-size` growth rate to project future capacity needs. This baseline tells you whether the system is stable and how demand has changed since initial deployment — if backends have grown from 20 to 200, your original configuration assumptions are likely outdated.
2Assess memory configuration and cache efficiency
Calculate your cache hit ratio by comparing `postgresql-blocks-hit` to `postgresql-blocks-read` — the ratio should be above 99%. If it's lower, you're hitting disk too often and shared_buffers is undersized for your current working set. This is the most common configuration drift as workloads grow: what was adequate memory for 10GB of hot data won't work for 50GB. Start with setting shared_buffers to 25% of total RAM if you're below that threshold.
3Review connection capacity and pooling adequacy
Check if `postgresql-backends` is consistently above 80% of `postgresql-connection-max` — this indicates you're running close to connection exhaustion. Use `postgresql-connections-by-process` to identify which applications are consuming connections and whether they're properly pooling. In my experience, workload evolution often means more microservices or workers connecting, and initial max_connections settings (often defaulted to 100) become dangerously inadequate. Implement PgBouncer or similar connection pooling if you're not already using it.
4Evaluate vacuum and maintenance health
Review `dead-tuple` (n_dead_tup) counts and calculate dead tuple ratio — if it's above 10% on high-activity tables, autovacuum is falling behind. Check `postgresql-last-vacuum` and `postgresql-autovacuumed` timestamps to see if tables are going days without maintenance, and review `postgresql-table-bloat` percentage. As workloads grow from occasional updates to high-frequency writes, default autovacuum settings (vacuum_scale_factor=0.2) often can't keep up, leading to bloat that degrades query performance and wastes disk space.
5Check for statistics staleness and query planning issues
If `postgresql-queries-duration-max` has increased over time without obvious application changes, check whether ANALYZE is running regularly — stale statistics cause the planner to choose inefficient execution plans. Also review `postgresql-temp-bytes`: high temporary file usage (>1GB/hour) often indicates queries spilling to disk due to poor plans or undersized work_mem. The statistics that worked for your initial small dataset won't represent current data distributions after months of growth.
6Assess critical operational risks and safety margins
Check `postgresql-before-xid-wraparound` — if you're below 10 million transactions remaining, you have a critical wraparound risk that requires immediate vacuum intervention. Review `replication-lag` if you have replicas (lag >60s indicates capacity issues), and check for long-running transactions via `postgresql-transactions-open` and excessive lock contention in `postgresql-locks`. These operational risks often emerge gradually as workloads evolve but can cause sudden, catastrophic failures if ignored.
7Review checkpoint behavior and I/O patterns
Examine `checkpoint-timed` frequency — if checkpoints are happening too frequently (every 1-2 minutes), you may need to increase checkpoint_timeout or max_wal_size. Frequent checkpoints cause I/O spikes that hurt performance. Correlate checkpoint timing with `postgresql-blocks-read` spikes to understand I/O patterns. As transaction volume grows, checkpoint settings that worked initially may now cause performance degradation during checkpoint storms.

Technologies

Related Insights

Cache hit ratio below 99% indicates undersized shared_buffers or high disk I/O
warning
Connection count approaching max_connections limit
critical
Autovacuum falling behind on tables causes bloat accumulation
warning
High dead tuple ratio causes table bloat and slow queries
warning
Stale statistics cause poor query plans
warning
overall instance health status
diagnostic_step
Stub insight created from scenario discovery hint: overall instance health status
configuration tuning needed
supporting_evidence
Stub insight created from scenario discovery hint: configuration tuning needed
workload characteristics and trends
supporting_evidence
Stub insight created from scenario discovery hint: workload characteristics and trends
capacity headroom adequacy
supporting_evidence
Stub insight created from scenario discovery hint: capacity headroom adequacy
operational best practice adherence
supporting_evidence
Stub insight created from scenario discovery hint: operational best practice adherence
reliability risk assessment
diagnostic_step
Stub insight created from scenario discovery hint: reliability risk assessment

Relevant Metrics

Monitoring Interfaces

PostgreSQL Prometheus
PostgreSQL Datadog
PostgreSQL Native
PostgreSQL Prometheus Exporter (pg_exporter)
PostgreSQL Datadog Integration
PostgreSQL CloudWatch Metrics