Right-sizing PostgreSQL cloud instances

infoCapacity Planning

Determining optimal CPU, memory, and storage configuration based on actual workload patterns to avoid over or under-provisioning

Prompt: “Help me determine whether my Postgres deployment on Google Cloud SQL should be provisioned up or down based on my current workload. I'm seeing 45% CPU utilization on average but occasional spikes to 85%.”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,162vs813

Latency

25.1smsvs18.8sms

Turns

6vs2

Tool Calls

2vs0

Schema Calls

0vs0

Response

128 charsvs1,648 chars

Evaluation

Without Schema

D+

With Schema

B+

The response without Schema provides minimal guidance, acknowledging the question but offering no specific recommendations or structured questions to move forward. The response with Schema delivers substantially more value with a clear decision framework including specific thresholds (keep current if spikes <5min and infrequent; scale up if approaching 85%+ regularly; scale down only if average <30% AND spikes <70%), targeted questions about spike patterns and performance impact, and suggestions for additional metrics to monitor. While comprehensive, the with-Schema response could be more immediately actionable with specific Cloud SQL commands or monitoring queries. The with-Schema response would reliably help an operator make an informed decision, while the without-Schema response leaves them without clear direction.

Without Schema

2 tool calls

I'll help you evaluate your Cloud SQL Postgres sizing. To make a good recommendation, I need to understand your workload better.

With Schema

I can help you analyze this. To give you the best recommendation, I need a bit more context about your workload patterns:

**Key questions to consider:**

1. **Spike frequency & duration**: How often do the 85% spikes occur, and how long do they last? (daily/hourly, seconds/minutes)

2. **Performance impact**: During those spikes, are you seeing:
   - Query latency increases?
   - Connection pool saturation?
   - Timeouts or slow responses?

3. **Spike patterns**: Are the spikes:
   - Predictable (e.g., daily batch jobs, peak business hours)?
   - Random/unpredictable?
   - Growing in frequency over time?

4. **Growth trajectory**: Is your workload growing, stable, or declining?

**Initial assessment based on what you've shared:**

- **45% average CPU** → Generally healthy, not over-provisioned
- **85% spikes** → Concerning if frequent or prolonged

**General recommendations:**

- **Keep current size** if spikes are brief (<5 min), infrequent (few times/day), and cause no user-facing issues
- **Scale up** if:
  - Spikes cause performance degradation
  - Spikes are increasing in frequency
  - You're approaching 85%+ regularly (less than 15% headroom is risky)
  - You have critical workloads requiring consistent low latency
- **Scale down** only if average drops below 30% AND spikes stay below 70%

**Better metrics to check:**

Could you share more details about your workload patterns? Also, it would help to look at:
- Memory utilization
- Connection count
- Disk I/O metrics
- Query performance during spike periods

Would you like me to help you write a script to analyze Cloud SQL monitoring data or set up better alerting?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When right-sizing a PostgreSQL instance, start by determining whether CPU spikes correlate with actual performance degradation rather than just resource utilization percentages. Then investigate memory efficiency through cache hit ratios and disk spills — many instances appear under-provisioned but are actually misconfigured. Finally, check I/O patterns and query efficiency to distinguish between needing more resources versus needing better tuning.

1Correlate CPU spikes with query performance degradation

Look at `postgresql-active-queries` and `postgresql-backends` during those 85% CPU spike windows — if query counts aren't spiking and there's no queuing, your current provisioning is fine and those bursts are normal. CPU utilization alone doesn't tell you if you're under-provisioned; what matters is whether queries are slowing down or backing up during spikes. In Cloud SQL, brief spikes to 85% with good query performance indicate healthy burst behavior, not a need to scale up.

active_queriespostgresql.backends

2Check buffer cache hit ratio for memory efficiency

Examine `buffer-hit` ratio — it should be above 99% for typical OLTP workloads and above 95% for mixed workloads. A low hit ratio suggests either insufficient shared_buffers (see `insufficient-shared-buffers-cache`) or a working set larger than available memory. However, check your shared_buffers configuration first: Cloud SQL defaults are often conservative, and increasing shared_buffers to 25% of RAM can dramatically improve cache efficiency without scaling up the instance.

Insufficient shared_buffers limits PostgreSQL cache effectiveness

3Look for queries spilling to disk due to insufficient work_mem

Check `postgresql-temp-files` and `postgresql-temp-bytes` — any non-zero values mean queries with sorts, aggregations, or hash joins are spilling to disk because work_mem is too low (typically defaults to just 4MB). This causes massive slowdowns and makes your instance appear CPU/IO-bound when the real issue is memory configuration. As noted in `work-mem-disk-spill` and `insufficient-work-mem`, increasing work_mem to 16-64MB for OLTP or 128MB+ for analytics can eliminate this bottleneck without resizing the instance.

Inadequate work_mem causes query spill to disk Insufficient work_mem causes disk-based sorting and hashing postgresql.temp_filespostgresql.temp_bytes

4Examine I/O wait times during CPU spikes

Review `postgresql-blk-read-time` and `postgresql-blk-write-time` during your CPU spike windows — high block read times (>5-10ms average) indicate I/O bottleneck rather than CPU constraint. If I/O wait is low during CPU spikes, you're truly CPU-bound and might benefit from more vCPUs. If I/O wait is high, the problem is disk throughput or inefficient queries causing excessive reads, not insufficient CPU, and scaling up CPU cores won't help much.

postgresql.blk_read_timepostgresql.blk_write_time

5Identify inefficient sequential scans on large tables

Look at `postgresql-sequential-scans` on tables over 1M rows (correlate with `postgresql-database-size` to identify large tables) — frequent sequential scans generate unnecessary CPU and I/O load that makes your instance appear under-provisioned when the real issue is missing indexes. High sequential scan counts during your CPU spikes suggest that adding indexes could eliminate the need to scale up. This is especially common after schema changes or with poorly optimized application queries.

postgresql.sequential_scanspostgresql.database.size

6Verify shared_buffers isn't oversized causing double buffering

If shared_buffers is set above 40% of total RAM, you're likely experiencing double buffering (see `double-buffering-shared-buffers-oversized`) where the same data pages exist in both PostgreSQL's cache and the OS page cache, wasting memory while other hot pages get evicted. This makes the instance appear memory-constrained when it's actually misconfigured. Check your Cloud SQL shared_buffers parameter — if it's over 40%, reduce it to 25% and monitor whether cache performance improves without scaling up.

Double buffering wastes RAM when shared_buffers exceeds 40% of total memory