Memory Exhaustion and OOM Killer Prevention

criticalIncident Response

Preventing PostgreSQL crashes from Linux OOM killer when work_mem settings multiply across concurrent sessions causing system memory exhaustion.

Prompt: “My PostgreSQL server keeps crashing and the Linux OOM killer is terminating the postgres process. I have work_mem set to 256MB but we have hundreds of concurrent connections. How do I calculate safe memory limits and prevent these crashes?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,527vs1,085

Latency

31.2smsvs24.0sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

3,626 charsvs2,649 chars

Evaluation

Without Schema

A-

With Schema

B+

Both responses provide technically accurate, actionable guidance with concrete calculations and immediate fixes for the OOM crash problem. The response without Schema is more comprehensive, including additional safeguards (OOMScoreAdjust systemd configuration), extensive monitoring queries, and detailed PgBouncer setup examples. The response with Schema covers similar ground more concisely and includes logging configuration (log_temp_files, statement_timeout) not present in the baseline. Both correctly identify connection pooling as the essential long-term solution and provide similar mathematical approaches to calculating safe memory limits. The primary difference lies in thoroughness and presentation style rather than fundamental accuracy or problem-solving approach.

Without Schema

I can help you calculate safe PostgreSQL memory limits and prevent OOM killer crashes. The issue is that with `work_mem=256MB` and hundreds of connections, your worst-case memory usage can easily exceed physical RAM.

## Understanding PostgreSQL Memory Usage

**The dangerous math:**
- Each connection can use `work_mem` **multiple times** per query (for sorts, hash joins, CTEs, etc.)
- With 300 connections × 256MB × 2-3 operations = **150-230GB** potential usage
- This doesn't even include `shared_buffers`, maintenance operations, or the ~10MB per-connection overhead

## Safe Memory Calculation

Here's a formula to calculate safe limits:

```
Total RAM Available for Postgres = Total RAM - (OS + other services)
                                   ≈ Total RAM × 0.75

shared_buffers = Total RAM × 0.25 (typical: 25% of RAM, max 40%)

work_mem = (Total RAM × 0.25) / (max_connections × 3)
           ↑                    ↑                 ↑
           RAM for work_mem     connections       operations per query
```

**Example:** For a 32GB RAM server with 300 connections:
```
shared_buffers = 8GB
work_mem = (32GB × 0.25) / (300 × 3) = 8GB / 900 ≈ 9MB
```

Yes, **9MB** not 256MB! For hundreds of connections, you need connection pooling.

## Immediate Fix

**1. Lower work_mem immediately:**
```sql
-- Check current settings
SHOW work_mem;
SHOW max_connections;
SHOW shared_buffers;

-- Set safer value (requires restart or reload)
ALTER SYSTEM SET work_mem = '8MB';
SELECT pg_reload_conf();
```

**2. Set per-session work_mem for heavy queries:**
```sql
-- For specific heavy queries, increase temporarily
SET work_mem = '256MB';
SELECT ... -- your heavy query
RESET work_mem;
```

## Long-term Solution: Connection Pooling

Use **PgBouncer** or **pgpool-II** to reduce actual Postgres connections:

```ini
# PgBouncer config example
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000     # Accept 1000 app connections
default_pool_size = 50     # But only use 50 Postgres connections
reserve_pool_size = 25
```

This lets you keep `work_mem=256MB` with only 50 actual connections, while supporting hundreds of client connections.

## Additional Safeguards

**1. Set memory overcommit protection:**
```bash
# Check current OOM killer settings
cat /proc/sys/vm/overcommit_memory

# Edit PostgreSQL systemd service
sudo systemctl edit postgresql

# Add:
[Service]
OOMScoreAdjust=-200
```

**2. Monitor memory usage:**
```sql
-- Check active queries and their memory
SELECT pid, usename, application_name, state, query
FROM pg_stat_activity 
WHERE state = 'active';

-- Check table sizes (large tables need more work_mem for sorts)
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC
LIMIT 10;
```

**3. Set hard limits in postgresql.conf:**
```conf
max_connections = 300
shared_buffers = 8GB
effective_cache_size = 24GB  # Doesn't allocate, just informs planner
maintenance_work_mem = 1GB
work_mem = 8MB
max_worker_processes = 8
max_parallel_workers_per_gather = 2
```

## What would help most?

Would you like me to:
1. **Calculate specific values** for your server (tell me your total RAM)
2. **Set up PgBouncer** configuration
3. **Create monitoring queries** to identify memory-hungry queries
4. **Review your current postgresql.conf** if you share it

The quick fix is lowering `work_mem` to 8-16MB, but the real solution is connection pooling so you can keep higher `work_mem` with fewer actual connections.

With Schema

The OOM killer is targeting your PostgreSQL because you're likely exceeding available RAM. With 256MB `work_mem` and hundreds of connections, you're at high risk of memory exhaustion.

## Memory Calculation

**Critical formula:**
```
Total RAM needed = shared_buffers + 
                   (max_connections × (work_mem × max_operations_per_query)) +
                   maintenance_work_mem +
                   OS overhead (2-4GB)
```

**Your current risk:**
- 256MB work_mem × 4 operations (typical complex query) × 300 connections = **307GB** potential usage
- A single query can use work_mem multiple times for sorts, hash joins, CTEs, etc.

## Immediate Solutions

### 1. **Reduce work_mem dramatically**
```sql
-- In postgresql.conf
work_mem = 4MB  # Start conservative
```
Then selectively increase for specific queries:
```sql
SET work_mem = '256MB';
-- run expensive query
```

### 2. **Implement connection pooling** (most important)
Use PgBouncer or similar to reduce actual backend connections:
```ini
# PgBouncer example
max_client_conn = 1000
default_pool_size = 20     # Actual PostgreSQL connections
```
This changes your calculation from 300 connections to 20.

### 3. **Set realistic max_connections**
```sql
-- In postgresql.conf
max_connections = 100  # Not 1000
```

### 4. **Adjust shared_buffers**
```sql
-- Typically 25% of total RAM
shared_buffers = 8GB  # Example for 32GB system
```

## Calculate Safe Limits

For a system with **X GB total RAM**:
```
shared_buffers = X × 0.25
work_mem = (X × 0.5) / (max_connections × 4)
Reserve X × 0.25 for OS
```

**Example (32GB system, 20 pooled connections):**
- shared_buffers = 8GB
- work_mem = 16GB / (20 × 4) = 200MB
- OS overhead = 8GB

## Monitor Memory Usage

```sql
-- Check current memory settings
SHOW shared_buffers;
SHOW work_mem;
SHOW max_connections;

-- See active connections
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
```

Linux monitoring:
```bash
# Check PostgreSQL memory usage
ps aux | grep postgres | awk '{sum+=$6} END {print sum/1024 " MB"}'

# Monitor OOM killer logs
dmesg | grep -i "killed process"
journalctl -k | grep -i "out of memory"
```

## Prevent Future Crashes

1. **Enable logging** to catch memory-intensive queries:
   ```sql
   log_temp_files = 0  # Log all temp file usage
   log_min_duration_statement = 1000  # Log slow queries
   ```

2. **Set statement timeout**:
   ```sql
   statement_timeout = '300s'
   ```

3. **Consider upgrading** to a system with more RAM or use connection pooling

What's your current server RAM and typical `max_connections` setting? I can help calculate specific safe values for your setup.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

Memory exhaustion from OOM killer is almost always caused by work_mem multiplying across concurrent connections beyond available RAM. Start by calculating your total potential memory consumption (max_connections × work_mem × operations_per_query), then verify actual usage patterns, and finally tune Linux overcommit settings to prevent unpredictable kills.

1Calculate total potential work_mem consumption

With work_mem=256MB and hundreds of connections, you're at immediate risk. Work_mem is allocated per operation, not per connection—a single query with 3 hash joins can consume 768MB. Check `postgresql.connection.max` and calculate worst case: if max_connections=200, that's 200 × 256MB × 3 operations = 150GB potential consumption. The insight on work_mem concurrent exhaustion shows this is the primary cause of OOM kills. Start by reducing work_mem to 64MB globally and only increase for specific analytical workloads via session-level SET commands.

Excessive work_mem allocation causes OOM Killer to terminate PostgreSQL Excessive work_mem allocation can exhaust system memory under concurrent load Insufficient memory allocation when work_mem multiplies across concurrent sessions postgresql.connection.max

2Check actual connection usage and patterns

Look at `connections_by_process` to see how many connections are actually active versus idle. Many applications over-provision max_connections when connection pooling would be better. If you're seeing >300 connections with `postgresql.connection.max`, you're burning ~10MB per connection just for overhead before any work_mem allocation. The max-connections-memory-exhaustion insight confirms that keeping max_connections between 100-200 and using pgBouncer for connection pooling prevents this waste.

Excessive max_connections consumes memory and requires connection pooling postgresql.connection.max

3Review temp file usage to validate work_mem adequacy

Check `temp_bytes` to see if queries are spilling to disk because work_mem is too small. High temp file creation means work_mem is insufficient for actual workloads, but in your scenario with 256MB work_mem causing OOM kills, you likely have the opposite problem—work_mem is too generous for your connection count. If temp_bytes is low, your current work_mem is adequate; reduce it to prevent OOM. If temp_bytes is high after reducing work_mem, use session-level adjustments for heavy queries rather than increasing the global setting.

Excessive work_mem allocation can exhaust system memory under concurrent load temp_bytes

4Configure Linux memory overcommit to prevent unpredictable OOM kills

Set vm.overcommit_memory=2 and vm.overcommit_ratio=50-80 in /etc/sysctl.conf to make the kernel refuse allocations that would exceed available RAM rather than allowing overcommit and invoking OOM killer unpredictably. The linux-overcommit-oom-kills-postgres insight shows that default overcommit settings (vm.overcommit_memory=0) cause OOM killer to terminate PostgreSQL even when operating within configured limits. With strict overcommit, you'll get predictable malloc failures instead of random process terminations, making memory issues visible before they become crashes.

Linux memory overcommit can cause OOM killer to terminate PostgreSQL

5Calculate total memory allocation across all PostgreSQL settings

Review shared_buffers (check `buffercache_summary_buffer_used`), maintenance_work_mem, and autovacuum_work_mem in addition to work_mem. The oom-killer-invoked-postgresql insight confirms you need to calculate total requirements versus available RAM. Typical safe allocation: shared_buffers=25% of RAM, then allocate remaining for work_mem considering max_connections. On a 32GB server, that's ~8GB for shared_buffers, leaving 20GB (after OS overhead) for work_mem—with max_connections=200, that means 100MB max per connection, but remember operations multiply this.

Linux OOM killer terminates PostgreSQL processes buffercache_summary_buffer_used

6Monitor for memory pressure and swapping

Track system memory usage and swap activity continuously—any swapping means you're already in trouble. The low-memory-swapping-crash insight shows that swapping degrades performance before causing crashes. Use `vmstat 1` or similar to watch for swap activity and `free -m` to check available memory. If you see swapping with your current settings, you've confirmed memory exhaustion is imminent, and the OOM killer will strike when a memory spike occurs during concurrent query execution.

Low memory availability causes swapping and database crashes