Connection Pool Exhaustion

criticalIncident Response

Application is getting connection refused errors as PostgreSQL approaches or hits max_connections limit.

Prompt: “We're getting 'FATAL: sorry, too many clients already' errors in our app logs. Our Cloud SQL PostgreSQL instance is at 95% of max_connections. What do I do right now and how do I prevent this?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,206vs2,431

Latency

24.0smsvs1.0mms

Turns

2vs23

Tool Calls

0vs11

Schema Calls

0vs7

Response

3,394 charsvs2,001 chars

Evaluation

Without Schema

B+

With Schema

A-

Both responses provide specific, actionable guidance with SQL queries and configuration examples. The response without Schema is more comprehensive, covering multiple pooling options, investigation queries, and long-term strategies like read replicas. The response with Schema is more focused and includes critical context: it explains the memory overhead per connection (5-10MB) and why pooling solves the problem, making the recommended solution easier to understand. It also cites the official PostgreSQL documentation and mentions the important operational detail of superuser_reserved_connections. While the response without Schema offers more breadth, the response with Schema's grounding in authoritative sources and clearer explanation of the underlying issue give it a slight edge for getting an operator on the right track quickly.

Without Schema

I'll help you address this connection exhaustion issue. Here's what to do:

## **Immediate Actions (Right Now)**

1. **Increase max_connections temporarily** (Cloud SQL Console or CLI):
   ```bash
   gcloud sql instances patch YOUR_INSTANCE_NAME \
     --database-flags=max_connections=200
   ```
   Note: This requires a restart. Default is usually 100; consider 150-200 depending on your instance size.

2. **Kill idle/stuck connections** (if you can access the DB):
   ```sql
   -- Find idle connections
   SELECT pid, usename, application_name, state, state_change 
   FROM pg_stat_activity 
   WHERE state = 'idle' AND state_change < now() - interval '5 minutes';
   
   -- Kill them (be careful!)
   SELECT pg_terminate_backend(pid) 
   FROM pg_stat_activity 
   WHERE state = 'idle' AND state_change < now() - interval '10 minutes';
   ```

3. **Restart your application pods/instances** (if they're leaking connections):
   ```bash
   kubectl rollout restart deployment/your-app  # or equivalent
   ```

## **Investigate the Root Cause**

```sql
-- See current connection usage by application/user
SELECT application_name, usename, count(*) 
FROM pg_stat_activity 
GROUP BY application_name, usename 
ORDER BY count(*) DESC;

-- Check for long-running queries
SELECT pid, now() - query_start as duration, state, query 
FROM pg_stat_activity 
WHERE state != 'idle' 
ORDER BY duration DESC;
```

Common culprits:
- **Connection leaks**: App not closing connections
- **No connection pooling**: Each request opens a new connection
- **Pool misconfiguration**: Pool size × app instances > max_connections
- **Connection storms**: Retries/restarts all connecting at once

## **Short-term Fixes**

1. **Add/fix connection pooling** in your app:
   - **PgBouncer** (recommended for Cloud SQL - can be a sidecar)
   - Transaction pooling mode typically allows 10-100x more clients
   
   ```bash
   # Example PgBouncer config
   [databases]
   mydb = host=CLOUD_SQL_IP port=5432 dbname=mydb
   
   [pgbouncer]
   pool_mode = transaction
   max_client_conn = 1000
   default_pool_size = 20  # Actual DB connections
   ```

2. **Fix application connection management**:
   - Ensure connections are returned to pool in `finally` blocks
   - Set reasonable connection timeouts
   - Configure pool size properly: `instances × pool_size < max_connections × 0.8`

## **Long-term Prevention**

1. **Right-size your connection pool**:
   ```
   Total pool size = (core_count × 2) + effective_spindle_count
   # For Cloud SQL, typically 10-20 connections per app instance
   ```

2. **Use Cloud SQL Proxy or Auth Proxy** (connection pooling built-in)

3. **Implement connection pooling strategy**:
   - App-level pooling (HikariCP for Java, psycopg2 pool for Python, pgx pool for Go)
   - External pooler (PgBouncer, PgPool-II)
   - Both (app pool → PgBouncer → Cloud SQL)

4. **Add monitoring/alerts**:
   ```sql
   -- Create a view to monitor connection usage
   CREATE VIEW connection_usage AS
   SELECT count(*) as connections, 
          count(*) * 100.0 / (SELECT setting::int FROM pg_settings WHERE name='max_connections') as pct
   FROM pg_stat_activity;
   ```
   Alert when connections > 80% of max.

5. **Consider Cloud SQL read replicas** for read-heavy workloads

Do you have access to your application code and Cloud SQL console? I can help you implement specific fixes based on your stack.

With Schema

11 tool calls7 Schema

I'll get you expert guidance on handling this PostgreSQL connection limit issue on Cloud SQL.
## Immediate Actions

**Right now, do this:**

1. **Identify connection hogs:**
   ```sql
   SELECT pid, usename, application_name, state, 
          now() - state_change AS duration
   FROM pg_stat_activity 
   WHERE state != 'idle' 
   ORDER BY state_change;
   ```

2. **Kill stuck/long-running connections if needed:**
   ```sql
   SELECT pg_terminate_backend(pid) 
   FROM pg_stat_activity 
   WHERE state = 'idle in transaction' 
   AND now() - state_change > interval '10 minutes';
   ```

3. **Check your Cloud SQL max_connections setting:**
   ```bash
   gcloud sql instances describe YOUR_INSTANCE_NAME --format="value(settings.databaseFlags)"
   ```

## Why This Happens

Each PostgreSQL connection consumes **5-10MB RAM** via the fork model. At your scale:
- 200 connections = 1-2GB RAM overhead *before queries even run*
- 500+ connections = excessive context-switching kills performance
- At 95% utilization, you're one small spike away from total failure

## Long-term Fix: Deploy Connection Pooling

**Use PgBouncer** between your app and Cloud SQL:

```
# PgBouncer config
[databases]
yourdb = host=YOUR_CLOUD_SQL_IP port=5432 dbname=yourdb

[pgbouncer]
pool_mode = transaction
default_pool_size = 20          # actual DB connections
max_client_conn = 1000          # app connections allowed
query_timeout = 30              # prevent connection starvation
```

This lets 1,000 app connections share just 20 real database connections.

## Configuration Changes

Set these on your Cloud SQL instance:
- `max_connections`: increase if currently too low (but pooling is better)
- `superuser_reserved_connections = 3`: reserves admin slots so you can always connect during exhaustion

## Monitoring

Alert when `postgresql.backends` reaches **80% of max_connections** (you're already past this at 95%).

Source: [PostgreSQL Runtime Configuration](https://www.postgresql.org/docs/15/runtime.html)

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When you're getting 'too many clients' errors, first understand what's consuming your connections — idle-in-transaction sessions are the most common culprit. Check connection state distribution, hunt for leaks by application, then assess whether to increase max_connections (with memory considerations) or deploy PgBouncer for long-term connection pooling. The immediate fix is killing idle-in-transaction sessions; the sustainable fix is PgBouncer.

1Check connection state distribution and identify what's consuming slots

Start by querying `pg_stat_activity` to see how your connections are distributed — active, idle, idle-in-transaction. Compare `postgresql.backends` against `postgresql.connection.max` and check `postgresql.percent_usage_connections` to confirm you're actually at 95%+ utilization. If you have dozens or hundreds of connections in 'idle' or 'idle in transaction' states, that's your smoking gun — connections aren't being released properly.

Connection count approaching max_connections limit Connection count approaching max_connections causes connection failures Connection exhaustion prevents new client connections postgresql.backendspostgresql.connection.maxpercent_usage_connectionspostgresql.sessions

2Hunt for idle-in-transaction connections holding resources

Check `idle_in_transaction_time` to find sessions stuck in open transactions — these are the #1 cause of connection exhaustion because they hold locks and prevent connection reuse. Query `pg_stat_activity WHERE state = 'idle in transaction'` and look at `state_change` timestamp. If you see connections idle-in-transaction for hours, that's a connection leak. Immediately set `idle_in_transaction_session_timeout` to 1-5 minutes to auto-terminate these in the future.

idle in transaction connections holding resources Idle transactions hold locks and block operations idle_in_transaction_timeactive_queries

3Identify connection leaks by application or user

Use `postgresql.connections_by_process` to break down connections by client application and user. If one app or service has 80+ connections when it should have 20, you've found your leak. Check if Django apps have `CONN_MAX_AGE` set too high (>300 seconds) or if connection pools aren't properly bounded. Also look for N+1 query patterns — under load, routes with eager loading can spawn thousands of connections.

Database connection pool exhaustion causes Connection refused errors connection leaks N+1 database queries with eager loading cause connection exhaustion connections_by_processpostgresql.sessions

4Assess immediate max_connections increase with memory constraints

You can increase `max_connections` as a quick fix, but understand the memory cost: each connection consumes 5-10MB for the forked backend process. If you bump from 100 to 500, that's ~2-4GB just for connection overhead. Check available memory and set `max_connections` to 200-300 if you have headroom, but don't go beyond that without deploying a connection pooler. At 500+ connections, context-switching overhead dominates and the database becomes unresponsive.

Connection exhaustion blocks new client connections Excessive max_connections consumes memory and requires connection pooling memory constraints from too many connections Connection exhaustion prevents new client connections postgresql.connection.maxpostgresql.backends

5Deploy PgBouncer for long-term connection pooling

PgBouncer is the sustainable fix for connection exhaustion — it sits between your app and PostgreSQL, multiplexing hundreds or thousands of app connections onto a small pool of actual database connections (typically 20-50). Configure it in transaction mode with `default_pool_size=20-50` and `max_client_conn=1000`. Set your app's `CONN_MAX_AGE=0` so PgBouncer manages connections, not the app. This is especially critical for multi-process app servers (Gunicorn, Prefect workers) that multiply connections by worker count.

Connection exhaustion prevents new client connections PostgreSQL connection count exceeds limits causing memory exhaustion connection pooler needed postgresql.backendspostgresql.connection.max

6Monitor connection patterns to prevent recurrence

Set up alerts when `postgresql.percent_usage_connections` crosses 80% — this gives you runway before you hit the wall. Track `postgresql.waiting_queries` and `postgresql.active_queries` to correlate connection exhaustion with query backlog. After deploying PgBouncer, monitor that the backend connection count stays low (20-50) while client connections can scale to hundreds. This confirms the pooler is working correctly.

Connection limit approaching maximum causes application timeouts Connection count approaching max_connections causes connection failures percent_usage_connectionswaiting_queriesactive_queriespostgresql.backends