Connection Pool Exhaustion

criticalIncident Response

Application is getting connection refused errors as PostgreSQL approaches or hits max_connections limit.

Prompt: We're getting 'FATAL: sorry, too many clients already' errors in our app logs. Our Cloud SQL PostgreSQL instance is at 95% of max_connections. What do I do right now and how do I prevent this?

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens
1,206vs2,431
Latency
24.0smsvs1.0mms
Turns
2vs23
Tool Calls
0vs11
Schema Calls
0vs7
Response
3,394 charsvs2,001 chars
Evaluation
Without Schema
B+
vs
With Schema
A-
Both responses provide specific, actionable guidance with SQL queries and configuration examples. The response without Schema is more comprehensive, covering multiple pooling options, investigation queries, and long-term strategies like read replicas. The response with Schema is more focused and includes critical context: it explains the memory overhead per connection (5-10MB) and why pooling solves the problem, making the recommended solution easier to understand. It also cites the official PostgreSQL documentation and mentions the important operational detail of superuser_reserved_connections. While the response without Schema offers more breadth, the response with Schema's grounding in authoritative sources and clearer explanation of the underlying issue give it a slight edge for getting an operator on the right track quickly.
Without Schema
I'll help you address this connection exhaustion issue. Here's what to do: ## **Immediate Actions (Right Now)** 1. **Increase max_connections temporarily** (Cloud SQL Console or CLI): ```bash gcloud sql instances patch YOUR_INSTANCE_NAME \ --database-flags=max_connections=200 ``` Note: This requires a restart. Default is usually 100; consider 150-200 depending on your instance size. 2. **Kill idle/stuck connections** (if you can access the DB): ```sql -- Find idle connections SELECT pid, usename, application_name, state, state_change FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '5 minutes'; -- Kill them (be careful!) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '10 minutes'; ``` 3. **Restart your application pods/instances** (if they're leaking connections): ```bash kubectl rollout restart deployment/your-app # or equivalent ``` ## **Investigate the Root Cause** ```sql -- See current connection usage by application/user SELECT application_name, usename, count(*) FROM pg_stat_activity GROUP BY application_name, usename ORDER BY count(*) DESC; -- Check for long-running queries SELECT pid, now() - query_start as duration, state, query FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC; ``` Common culprits: - **Connection leaks**: App not closing connections - **No connection pooling**: Each request opens a new connection - **Pool misconfiguration**: Pool size × app instances > max_connections - **Connection storms**: Retries/restarts all connecting at once ## **Short-term Fixes** 1. **Add/fix connection pooling** in your app: - **PgBouncer** (recommended for Cloud SQL - can be a sidecar) - Transaction pooling mode typically allows 10-100x more clients ```bash # Example PgBouncer config [databases] mydb = host=CLOUD_SQL_IP port=5432 dbname=mydb [pgbouncer] pool_mode = transaction max_client_conn = 1000 default_pool_size = 20 # Actual DB connections ``` 2. **Fix application connection management**: - Ensure connections are returned to pool in `finally` blocks - Set reasonable connection timeouts - Configure pool size properly: `instances × pool_size < max_connections × 0.8` ## **Long-term Prevention** 1. **Right-size your connection pool**: ``` Total pool size = (core_count × 2) + effective_spindle_count # For Cloud SQL, typically 10-20 connections per app instance ``` 2. **Use Cloud SQL Proxy or Auth Proxy** (connection pooling built-in) 3. **Implement connection pooling strategy**: - App-level pooling (HikariCP for Java, psycopg2 pool for Python, pgx pool for Go) - External pooler (PgBouncer, PgPool-II) - Both (app pool → PgBouncer → Cloud SQL) 4. **Add monitoring/alerts**: ```sql -- Create a view to monitor connection usage CREATE VIEW connection_usage AS SELECT count(*) as connections, count(*) * 100.0 / (SELECT setting::int FROM pg_settings WHERE name='max_connections') as pct FROM pg_stat_activity; ``` Alert when connections > 80% of max. 5. **Consider Cloud SQL read replicas** for read-heavy workloads Do you have access to your application code and Cloud SQL console? I can help you implement specific fixes based on your stack.
With Schema
11 tool calls7 Schema
I'll get you expert guidance on handling this PostgreSQL connection limit issue on Cloud SQL. ## Immediate Actions **Right now, do this:** 1. **Identify connection hogs:** ```sql SELECT pid, usename, application_name, state, now() - state_change AS duration FROM pg_stat_activity WHERE state != 'idle' ORDER BY state_change; ``` 2. **Kill stuck/long-running connections if needed:** ```sql SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '10 minutes'; ``` 3. **Check your Cloud SQL max_connections setting:** ```bash gcloud sql instances describe YOUR_INSTANCE_NAME --format="value(settings.databaseFlags)" ``` ## Why This Happens Each PostgreSQL connection consumes **5-10MB RAM** via the fork model. At your scale: - 200 connections = 1-2GB RAM overhead *before queries even run* - 500+ connections = excessive context-switching kills performance - At 95% utilization, you're one small spike away from total failure ## Long-term Fix: Deploy Connection Pooling **Use PgBouncer** between your app and Cloud SQL: ``` # PgBouncer config [databases] yourdb = host=YOUR_CLOUD_SQL_IP port=5432 dbname=yourdb [pgbouncer] pool_mode = transaction default_pool_size = 20 # actual DB connections max_client_conn = 1000 # app connections allowed query_timeout = 30 # prevent connection starvation ``` This lets 1,000 app connections share just 20 real database connections. ## Configuration Changes Set these on your Cloud SQL instance: - `max_connections`: increase if currently too low (but pooling is better) - `superuser_reserved_connections = 3`: reserves admin slots so you can always connect during exhaustion ## Monitoring Alert when `postgresql.backends` reaches **80% of max_connections** (you're already past this at 95%). Source: [PostgreSQL Runtime Configuration](https://www.postgresql.org/docs/15/runtime.html)

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When you're getting 'too many clients' errors, first understand what's consuming your connections — idle-in-transaction sessions are the most common culprit. Check connection state distribution, hunt for leaks by application, then assess whether to increase max_connections (with memory considerations) or deploy PgBouncer for long-term connection pooling. The immediate fix is killing idle-in-transaction sessions; the sustainable fix is PgBouncer.

1Check connection state distribution and identify what's consuming slots
Start by querying `pg_stat_activity` to see how your connections are distributed — active, idle, idle-in-transaction. Compare `postgresql.backends` against `postgresql.connection.max` and check `postgresql.percent_usage_connections` to confirm you're actually at 95%+ utilization. If you have dozens or hundreds of connections in 'idle' or 'idle in transaction' states, that's your smoking gun — connections aren't being released properly.
2Hunt for idle-in-transaction connections holding resources
Check `idle_in_transaction_time` to find sessions stuck in open transactions — these are the #1 cause of connection exhaustion because they hold locks and prevent connection reuse. Query `pg_stat_activity WHERE state = 'idle in transaction'` and look at `state_change` timestamp. If you see connections idle-in-transaction for hours, that's a connection leak. Immediately set `idle_in_transaction_session_timeout` to 1-5 minutes to auto-terminate these in the future.
3Identify connection leaks by application or user
Use `postgresql.connections_by_process` to break down connections by client application and user. If one app or service has 80+ connections when it should have 20, you've found your leak. Check if Django apps have `CONN_MAX_AGE` set too high (>300 seconds) or if connection pools aren't properly bounded. Also look for N+1 query patterns — under load, routes with eager loading can spawn thousands of connections.
4Assess immediate max_connections increase with memory constraints
You can increase `max_connections` as a quick fix, but understand the memory cost: each connection consumes 5-10MB for the forked backend process. If you bump from 100 to 500, that's ~2-4GB just for connection overhead. Check available memory and set `max_connections` to 200-300 if you have headroom, but don't go beyond that without deploying a connection pooler. At 500+ connections, context-switching overhead dominates and the database becomes unresponsive.
5Deploy PgBouncer for long-term connection pooling
PgBouncer is the sustainable fix for connection exhaustion — it sits between your app and PostgreSQL, multiplexing hundreds or thousands of app connections onto a small pool of actual database connections (typically 20-50). Configure it in transaction mode with `default_pool_size=20-50` and `max_client_conn=1000`. Set your app's `CONN_MAX_AGE=0` so PgBouncer manages connections, not the app. This is especially critical for multi-process app servers (Gunicorn, Prefect workers) that multiply connections by worker count.
6Monitor connection patterns to prevent recurrence
Set up alerts when `postgresql.percent_usage_connections` crosses 80% — this gives you runway before you hit the wall. Track `postgresql.waiting_queries` and `postgresql.active_queries` to correlate connection exhaustion with query backlog. After deploying PgBouncer, monitor that the backend connection count stays low (20-50) while client connections can scale to hundreds. This confirms the pooler is working correctly.

Technologies

Related Insights

Connection exhaustion prevents new client connections
diagnostic_stepcritical
Each PostgreSQL connection consumes 5-10MB RAM via fork model. At 200 connections: 1-2GB RAM overhead before queries run. At 500+ connections: excessive context-switching dominates query execution time. Database becomes unresponsive under concurrent load.
Connection failure when requests exceed default 100 connection limit
critical
Connection count approaching max_connections limit
critical
Connection count approaching max_connections causes connection failures
critical
Connection exhaustion blocks new client connections
outcomecritical
Slow Query Backlog Masks Redis Connection Pool Exhaustion
warning
Redis slowlog entries accumulating (redis.slowlog.length rising) can indicate operations blocking on network or disk I/O, exhausting connection pools and causing cascading failures in dependent services even when Redis CPU appears healthy.
Connection Rejection Cascade From Maxclient Saturation
critical
When Redis reaches maximum client connections (redis.connections.rejected increasing), new connection attempts fail silently while existing connections continue working normally, creating intermittent failures that are difficult to diagnose from application metrics alone.
Redis Connection Pool Starvation from Blocking Patterns
warning
When async endpoints make synchronous Redis calls, they hold connections longer than necessary while blocking the event loop, causing artificial connection pool exhaustion even when Redis server capacity is available.
Redis Connection Saturation Stalls Async Event Loop
critical
When Redis connection pool exhausts under high concurrency, blocking Redis operations (even from async endpoints) stall the FastAPI event loop, causing serial-like request processing and tail latency spikes despite low CPU utilization.
PostgreSQL connection count exceeds limits causing memory exhaustion
warning
Database connection pool exhaustion causes Connection refused errors
critical
idle in transaction connections holding resources
Stub insight created from scenario discovery hint: idle in transaction connections holding resources
max_connections misconfigured for instance size
Stub insight created from scenario discovery hint: max_connections misconfigured for instance size
application connection management issues
Stub insight created from scenario discovery hint: application connection management issues
memory constraints from too many connections
Stub insight created from scenario discovery hint: memory constraints from too many connections
Connection exhaustion prevents new client connections
diagnostic_stepcritical
Connection limit approaching maximum causes application timeouts
threshold_indicatorcritical
connection pooler needed
prerequisite
Stub insight created from scenario discovery hint: connection pooler needed
connection leaks
prerequisite
Stub insight created from scenario discovery hint: connection leaks
Idle transactions hold locks and block operations
prerequisitecritical
Excessive max_connections consumes memory and requires connection pooling
supporting_evidencewarning
Insufficient shared_buffers causes excessive disk reads and slow queries
supporting_evidencewarning
N+1 database queries with eager loading cause connection exhaustion
prerequisitecritical

Relevant Metrics

Monitoring Interfaces

PostgreSQL Prometheus
PostgreSQL Datadog
PostgreSQL Native
PostgreSQL CloudWatch Metrics
Redis Datadog
Redis Prometheus
Redis Native Metrics
PostgreSQL Prometheus Exporter (pg_exporter)
PostgreSQL Datadog Integration