Connection Timeout and Network Issues

criticalIncident Response

Application experiencing connection timeouts or network-related PostgreSQL errors.

Prompt: “Our app is getting intermittent 'could not connect to server' and connection timeout errors from PostgreSQL. How do I diagnose whether this is a network issue, database overload, or configuration problem?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,336vs892

Latency

30.5smsvs18.8sms

Turns

6vs2

Tool Calls

2vs0

Schema Calls

0vs0

Response

3,451 charsvs2,560 chars

Evaluation

Without Schema

B+

With Schema

A-

The response without Schema provides comprehensive diagnostic coverage with detailed sections on network issues, database overload, and configuration problems, including extensive command examples and a helpful troubleshooting table. However, it begins with caveats about missing application code context and ends with four clarifying questions, making it less immediately actionable. The response with Schema takes a more direct approach, opening with specific diagnostic queries that can be executed immediately (checking pg_stat_activity states, connection usage versus limits, and pgBouncer pool status). Both responses cover similar diagnostic territory, but the version with Schema is more action-oriented and concise while maintaining good technical depth, whereas the baseline version offers slightly more detail in network diagnostics but in a more tutorial-like format that delays hands-on troubleshooting.

Without Schema

2 tool calls

I'll help you diagnose those PostgreSQL connection issues. Let me first check your current setup to provide more targeted advice.
Since I don't see application code here, I'll give you a comprehensive diagnostic approach for PostgreSQL connection issues. Here's how to systematically identify the root cause:

## **1. Quick Initial Checks**

**Check PostgreSQL is running:**
```bash
# On the database server
systemctl status postgresql
# or
pg_isready -h <host> -p 5432
```

**Test basic connectivity:**
```bash
# From your app server
telnet <db-host> 5432
# or
nc -zv <db-host> 5432
```

## **2. Diagnose Network Issues**

**Check for packet loss/latency:**
```bash
# Run this during the issue
ping -c 100 <db-host>
mtr <db-host>  # Better than ping, shows route issues
```

**Monitor connection establishment:**
```bash
# On app server - watch for SYN retransmits
tcpdump -i any port 5432 -nn
```

**Check firewall/security groups:**
- Verify iptables rules
- AWS security groups / cloud firewall rules
- Connection limits on load balancers

## **3. Diagnose Database Overload**

**Check active connections:**
```sql
-- Connect to PostgreSQL
SELECT count(*) FROM pg_stat_activity;
SELECT max_connections FROM pg_settings WHERE name = 'max_connections';

-- See connection states
SELECT state, count(*) 
FROM pg_stat_activity 
GROUP BY state;
```

**Monitor connection pool exhaustion:**
```sql
-- Check for connection spikes
SELECT datname, count(*) 
FROM pg_stat_activity 
GROUP BY datname;
```

**Check server load:**
```bash
# On database server
top
iostat -x 2  # Disk I/O
vmstat 2     # Memory/CPU
```

**Query PostgreSQL logs:**
```bash
# Look for these patterns
tail -f /var/log/postgresql/postgresql-*.log | grep -E "connection|timeout|max_connections"
```

## **4. Diagnose Configuration Issues**

**Check connection limits:**
```sql
SHOW max_connections;
SHOW superuser_reserved_connections;
```

**Review timeout settings:**
```sql
-- PostgreSQL side
SHOW statement_timeout;
SHOW idle_in_transaction_session_timeout;
SHOW tcp_keepalives_idle;
SHOW tcp_keepalives_interval;
```

**Application connection pool settings** (check your code):
- Pool size too large? (> max_connections / num_app_instances)
- Connection timeout too short?
- Idle connection timeout?
- Missing connection retry logic?

## **5. Common Culprits & Solutions**

| **Symptom** | **Likely Cause** | **Solution** |
|-------------|------------------|--------------|
| Errors during traffic spikes | Connection pool exhausted | Increase `max_connections`, tune pool size |
| Timeouts after idle periods | Firewall dropping idle conns | Enable TCP keepalives |
| Random DNS-related errors | DNS resolution failing | Use IP address or fix DNS |
| "Sorry, too many clients" | Hit max_connections | Increase limit or use pgBouncer |
| Slow then timeout | Long-running queries blocking | Add `statement_timeout`, optimize queries |

## **6. Monitoring Script**

Want me to create a diagnostic script that runs these checks automatically? It could:
- Test connectivity every N seconds
- Log connection pool stats
- Query PostgreSQL metrics
- Alert on anomalies

**What would help most?**
1. Do you have access to PostgreSQL logs?
2. Are you using a connection pooler (PgBouncer, pgpool)?
3. What language/framework is your app using?
4. Is this a cloud database (RDS, Cloud SQL) or self-hosted?

This info will help me give you more specific diagnostic commands for your setup.

With Schema

I'll help you systematically diagnose these PostgreSQL connection issues. Here's a practical approach to identify the root cause:

## Immediate Checks

**1. Check current connection status:**
```bash
# See active connections and their states
psql -c "SELECT state, count(*) FROM pg_stat_activity GROUP BY state;"

# Check max connections vs current usage
psql -c "SELECT max_conn, used, res_for_super FROM 
  (SELECT count(*) used FROM pg_stat_activity) t1,
  (SELECT setting::int res_for_super FROM pg_settings WHERE name='superuser_reserved_connections') t2,
  (SELECT setting::int max_conn FROM pg_settings WHERE name='max_connections') t3;"
```

**2. Check for connection pool exhaustion:**
```bash
# If using pgBouncer or similar
psql -p 6432 -U pgbouncer pgbouncer -c "SHOW POOLS;"
psql -p 6432 -U pgbouncer pgbouncer -c "SHOW CLIENTS;"
```

**3. Review recent logs for patterns:**
```bash
# Look for connection errors in PostgreSQL logs
tail -f /var/log/postgresql/postgresql-*.log | grep -E "connection|timeout|refused"

# Check for "too many connections", "remaining connection slots", authentication failures
```

## Diagnostic Tests

**Network Issues:**
- Test connectivity: `psql -h <db_host> -U <user> -d <db> -c "SELECT 1;"`
- Check DNS resolution time: `time nslookup <db_host>`
- Test from multiple app servers to see if it's isolated
- Check firewall/security group rules for connection limits

**Database Overload:**
- Check slow queries: `SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;`
- Monitor locks: `SELECT * FROM pg_locks WHERE NOT granted;`
- Check replication lag (if applicable): `SELECT * FROM pg_stat_replication;`
- CPU/Memory on DB server: `top` or cloud metrics

**Configuration Issues:**
```sql
-- Check key connection settings
SHOW max_connections;
SHOW shared_buffers;
SHOW work_mem;
SHOW statement_timeout;
SHOW idle_in_transaction_session_timeout;
```

## Common Culprits

1. **Connection pool misconfiguration** - App connection pool size > DB max_connections
2. **Idle connections not being closed** - Check for long-running idle transactions
3. **Network MTU/packet issues** - Especially in cloud/containerized environments
4. **TCP keepalive settings** - Connections timing out silently

**Would you like me to:**
- Help analyze your specific logs/metrics?
- Review your app's database connection configuration?
- Set up monitoring queries to track this over time?

Let me know what information you have available (logs, config files, monitoring tools) and I can provide more targeted diagnostics.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When troubleshooting PostgreSQL connection timeouts, start by confirming the server is running and accessible, then check for connection pool exhaustion—the most common culprit. Only after ruling out pool issues should you investigate network connectivity, database-side resource constraints, and timeout configurations. The key is distinguishing between client-side pool exhaustion, network problems, and server overload.

1Verify PostgreSQL server is running and accessible

First things first: check that PostgreSQL is actually running with `sudo systemctl status postgresql` or equivalent. Look at the `uptime` metric—if it's recently reset or shows zero, the server crashed. Verify you're connecting to the correct port (default 5432) by checking postgresql.conf. If the server is down or the port is mismatched, you'll get immediate connection refusals rather than intermittent timeouts, but it's the quickest thing to rule out.

PostgreSQL server process not running causes connection refusal uptime

2Check for connection pool exhaustion

In my experience, this is the #1 cause of intermittent connection errors. Check `postgresql.backends` against max_connections—if you're consistently at 95%+ capacity, new connection attempts will timeout or fail. Use `connections_by_process` to identify which applications are hogging connections. Run `SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction'` to find stuck connections that aren't being released. If using Django, verify CONN_MAX_AGE is set appropriately (0 for dev, 60-300 seconds for production)—values above 600 almost always cause pool exhaustion.

Database connection pool exhaustion causes Connection refused errors postgresql.backendsconnections_by_process

3Test basic network connectivity and DNS resolution

From your application server, test raw connectivity with `telnet <db-host> 5432` or `nc -zv <db-host> 5432`. If using a hostname, verify DNS with `nslookup` or `dig`—DNS caching issues or round-robin DNS can cause intermittent failures when one IP in the pool is bad. Check firewall rules and security groups to confirm port 5432 is open bidirectionally. Intermittent failures that resolve after 30-60 seconds often point to security group rule propagation delays or misconfigured health checks.

DNS resolution issues firewall or security group misconfiguration

4Assess database-side resource constraints

Check if PostgreSQL is overwhelmed and slow to accept new connections. Look at `waiting_queries`—if consistently above 10, the server is backed up. Examine `postgresql.locks` for blocking operations; high lock counts can cause connection establishment to timeout as the server struggles to allocate resources. If `postgresql.backends` is near max_connections and `checkpoint_timed` shows frequent checkpoints with high I/O, the server may be rejecting connections or taking 5+ seconds to accept them, triggering client timeouts.

database availability problem waiting_queriespostgresql.lockspostgresql.backendscheckpoint_timed

5Investigate network latency and packet loss

Run `mtr` tests from app to database server for at least 100 packets to measure latency distribution and identify packet loss. Even 1-2% packet loss can cause TCP connection establishment to retry and timeout. Check monitoring for network-related log flooding—sudden spikes in network error logs often indicate hardware failures, saturated links, or routing issues. If latency percentiles show p95 > 100ms on what should be a local network, investigate network path and switch configurations.

network latency or packet loss Network issues cause storage overflow with logs

6Review connection timeout configurations

Verify your application's connection timeout isn't too aggressive for your network conditions. If you're seeing timeouts on a 50ms latency network but your timeout is 1-2 seconds, you have almost no margin for TCP handshake retries. Check PostgreSQL's `tcp_keepalives_idle`, `tcp_keepalives_interval`, and `tcp_keepalives_count` settings in postgresql.conf, and verify client-side TCP keepalive is enabled. A timeout that's too short manifests as intermittent failures during normal load, while a genuine network issue shows 100% failure during problem windows.

timeout settings too aggressive