MultiXact Member Space Exhaustion

criticalIncident Response

Emergency response when PostgreSQL approaches MultiXact member wraparound, which prevents all write operations and requires immediate vacuum intervention.

Prompt: “My PostgreSQL database is throwing errors about MultiXact member space exhaustion and all writes are failing. I see messages about emergency autovacuum in the logs. What is MultiXact member space and how do I fix this before we have a complete outage?”

With vs Without Schema

Same prompt, same model (claude-sonnet-4-5-20250929) — the only difference is whether Schema's knowledge graph was available.

Tokens

1,450vs1,132

Latency

27.8smsvs24.5sms

Turns

2vs2

Tool Calls

0vs0

Schema Calls

0vs0

Response

4,216 charsvs3,091 chars

Evaluation

Without Schema

With Schema

A-

The response with Schema provides a more focused emergency response that includes a critical single-user mode option for completely locked databases, which the response without Schema omits entirely. While both responses offer good technical depth and diagnostic queries, the response without Schema asks clarifying questions at the end rather than providing immediate actionable guidance. The response with Schema also better communicates urgency with specific thresholds (95% usage, "hours at most") and organizes the information more effectively for emergency triage. The single-user mode fallback is a significant differentiator that could save critical time in a production outage scenario.

Without Schema

This is a critical PostgreSQL issue that needs immediate attention. Let me explain what's happening and how to fix it.

## What is MultiXact Member Space?

**MultiXact** (Multi-Transaction) is PostgreSQL's mechanism for tracking multiple transactions that hold row-level locks on the same tuple. This happens when:
- Multiple transactions use `SELECT ... FOR SHARE` or `SELECT ... FOR KEY SHARE`
- Foreign keys are being checked
- Multiple transactions are updating different columns of the same row

The "member space" stores the list of transaction IDs that share locks. It's a fixed-size circular buffer (similar to transaction ID space), and when it wraps around without being vacuumed, you get exhaustion.

## Immediate Diagnosis

Run these commands to check the severity:

```sql
-- Check current MultiXact age and limits
SELECT datname, 
       age(datminmxid) as mxid_age,
       mxid_age(datminmxid) as multixact_age
FROM pg_database 
ORDER BY mxid_age(datminmxid) DESC;

-- Check which tables need vacuuming most urgently
SELECT schemaname, tablename, 
       n_tup_ins + n_tup_upd + n_tup_del as total_changes,
       last_autovacuum, 
       last_vacuum,
       age(relfrozenxid) as xid_age,
       mxid_age(relminmxid) as mxid_age
FROM pg_stat_user_tables t
JOIN pg_class c ON c.relname = t.relname
WHERE c.relkind = 'r'
ORDER BY mxid_age DESC
LIMIT 20;
```

## Emergency Fixes (In Order)

### 1. **Immediate: Aggressive Manual VACUUM**

```sql
-- Run VACUUM on the most problematic tables immediately
-- Get the list from the query above and vacuum them
VACUUM (VERBOSE, ANALYZE) your_table_name;

-- If very urgent, run VACUUM FREEZE on critical tables
VACUUM (FREEZE, VERBOSE, ANALYZE) your_table_name;
```

### 2. **Temporarily Boost Autovacuum Aggressiveness**

```sql
-- Make autovacuum more aggressive (requires superuser)
ALTER SYSTEM SET autovacuum_max_workers = 6;  -- increase workers
ALTER SYSTEM SET autovacuum_naptime = '10s';  -- run more frequently
ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 2000;  -- allow more I/O
ALTER SYSTEM SET vacuum_multixact_failsafe_age = 400000000;  -- emergency vacuum threshold

-- Reload configuration
SELECT pg_reload_conf();
```

### 3. **Emergency: Kill Long-Running Transactions**

Long-running transactions prevent vacuum from cleaning up:

```sql
-- Find long-running transactions blocking cleanup
SELECT pid, datname, usename, state, 
       now() - xact_start AS duration,
       query
FROM pg_stat_activity
WHERE state != 'idle'
  AND now() - xact_start > interval '5 minutes'
ORDER BY duration DESC;

-- Kill problematic ones (carefully!)
SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE pid = <problematic_pid>;
```

### 4. **Per-Table Emergency Vacuum**

For the worst offenders:

```sql
-- Set aggressive vacuum settings per table
ALTER TABLE problematic_table SET (
  autovacuum_vacuum_scale_factor = 0.01,  -- vacuum at 1% changes
  autovacuum_vacuum_cost_delay = 0,       -- no delay
  autovacuum_multixact_freeze_max_age = 400000000
);

-- Then immediately vacuum it
VACUUM (FREEZE, VERBOSE, ANALYZE) problematic_table;
```

## Long-Term Prevention

After the emergency is resolved:

1. **Monitor MultiXact age** - set up alerts when it approaches 1.5 billion
2. **Tune autovacuum** - ensure it runs frequently enough for your workload
3. **Avoid long transactions** - they prevent cleanup
4. **Review locking patterns** - excessive FOR SHARE locks can cause buildup
5. **Consider increasing `autovacuum_multixact_freeze_max_age`** from default (400M) to 600M if needed

## Check Progress

Monitor the situation:

```sql
-- Watch vacuum progress
SELECT * FROM pg_stat_progress_vacuum;

-- Monitor MultiXact age trending down
SELECT datname, mxid_age(datminmxid) as multixact_age 
FROM pg_database 
WHERE datname = current_database();
```

**Critical**: Start with the manual VACUUM on your highest mxid_age tables immediately. The emergency autovacuum should be kicking in, but manual intervention is faster. Once writes are restored, tune autovacuum to prevent recurrence.

What database version are you running? And can you share the output of the first diagnostic query? This will help me give more specific guidance.

With Schema

This is a **critical production issue** that needs immediate action. Here's what's happening and how to fix it:

## What is MultiXact Member Space?

MultiXact (Multiple Transaction) IDs track situations where multiple transactions need to lock the same row simultaneously - typically with `SELECT FOR SHARE` locks or foreign key checks. PostgreSQL has limited space (2 billion entries) for these IDs, and when exhausted, all writes fail.

## Immediate Diagnosis

Run these queries to check the severity:

```sql
-- Check how close you are to wraparound
SELECT setting::bigint AS max_members,
       ROUND(100.0 * members_used / max_members, 2) AS pct_used
FROM pg_control_checkpoint(),
     (SELECT setting::bigint AS max_members 
      FROM pg_settings WHERE name = 'max_multixact_members') s,
     (SELECT SUM(age(mxid)) AS members_used 
      FROM pg_database) u;

-- Check which tables need vacuuming most urgently
SELECT schemaname, tablename, 
       age(datminmxid) as mxid_age,
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as size
FROM pg_tables t
JOIN pg_database d ON d.datname = current_database()
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY age(datminmxid) DESC
LIMIT 20;
```

## Immediate Fix

**Option 1: Aggressive Manual VACUUM (preferred if database is still accepting some commands)**

```sql
-- In psql, run with multiple workers
SET maintenance_work_mem = '2GB';  -- adjust based on available RAM
VACUUM (VERBOSE, FREEZE) your_problem_tables;
```

**Option 2: Emergency single-user mode (if database is completely locked)**

```bash
# Stop PostgreSQL
sudo systemctl stop postgresql

# Start in single-user mode (replace 'dbname' with your database)
sudo -u postgres postgres --single -D /var/lib/postgresql/data dbname

# In single-user mode, run:
VACUUM FREEZE;
\q

# Restart normally
sudo systemctl start postgresql
```

## Configuration Changes (prevent recurrence)

Add these to `postgresql.conf`:

```conf
# Trigger autovacuum earlier
autovacuum_multixact_freeze_max_age = 200000000  # down from 400M
autovacuum_max_workers = 5  # increase if you have CPU available

# Make autovacuum more aggressive
autovacuum_vacuum_cost_limit = 1000  # increase from default 200
autovacuum_naptime = 10s  # check more frequently

# Monitor settings
log_autovacuum_min_duration = 0  # log all autovacuum activity
```

Then reload: `SELECT pg_reload_conf();`

## Monitor Progress

```sql
-- Watch autovacuum activity
SELECT pid, datname, usename, state, query, 
       now() - query_start AS duration
FROM pg_stat_activity 
WHERE query LIKE '%autovacuum%' OR query LIKE '%VACUUM%';
```

## Root Cause Investigation (after fixing)

Check for:
- **Long-running transactions** blocking vacuum: `SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;`
- **Tables with foreign keys** (heavy MultiXact users)
- **Replication lag** preventing old MultiXact cleanup
- **Disabled autovacuum** on specific tables

**Time-sensitive**: If over 95% used, you have hours at most. Act immediately with Option 1 or 2 above.

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

MultiXact member space exhaustion is a critical emergency requiring immediate action. Start by checking how close you are to the ~4 billion member limit and identifying long-running transactions that block member space reclamation. Then verify emergency autovacuum is running, audit foreign key relationships causing quadratic member growth, and if necessary initiate manual data-only vacuums for faster recovery.

1Assess MultiXact member space consumption and confirm write blockage

Check `database_wraparound_age_datminmxid` immediately to see how close you are to the ~4 billion member limit that triggers write lockout. The member space is separate from MultiXact ID and fills quadratically when multiple transactions lock the same row. If you're seeing 'members space is nearly exhausted' errors, you're already in emergency territory and writes requiring new MultiXacts are blocked. This metric isn't exposed by standard PostgreSQL monitoring, so you may need to query pg_controldata or check member file storage directly.

MultiXact member space exhaustion triggers emergency vacuum and write operation lockout database_wraparound_age_datminmxid

2Identify and terminate long-running transactions blocking member reclamation

Query for long-running transactions using `postgresql.relation.xmin` and pg_stat_activity, focusing on transactions holding old MultiXacts. PostgreSQL can only reclaim member space in contiguous segments, so a single long-running transaction prevents vacuum from reclaiming newer unused member spaces. Terminate any non-critical long transactions immediately—this is an emergency. Avoid mixing long-running analytics queries with high-frequency transactional workloads during this crisis.

Long-running transactions prevent contiguous MultiXact member space reclamation postgresql.relation.xmin

3Verify emergency autovacuum is running and progressing

Check `autovacuumed` metric and pg_stat_progress_vacuum to confirm emergency autovacuum workers are active and not blocked. PostgreSQL should automatically trigger emergency vacuums when member space is exhausted, but these can be blocked by locks or insufficient worker capacity. If autovacuum isn't running or is making no progress, you'll need to manually intervene. On a 30TB+ database, these vacuums can take hours, so confirm they're actually working.

MultiXact member space exhaustion triggers emergency vacuum and write operation lockout autovacuum freeze not running autovacuumed

4Audit foreign key relationships for quadratic member growth patterns

Investigate foreign keys, especially on low-cardinality enum tables or frequently-referenced parent tables. When n backends concurrently insert rows referencing the same parent row, PostgreSQL creates O(n) MultiXacts with O(n²) total members—5 concurrent transactions create 14 members. Check recent write patterns and identify tables with foreign keys experiencing high concurrent insert volume. You may need to temporarily drop problematic foreign keys to stop the bleeding while vacuum catches up.

Foreign key relationships cause quadratic MultiXact member growth under concurrent writes

5Check for bulk operations creating sustained MultiXact pressure

Look for ongoing partitioning migrations, backfill jobs, or other bulk write operations that create sustained MultiXact pressure. Table partitioning with dual writes and backfills can rapidly consume member space, especially when combined with foreign keys. If you're running monthly invoice generation or similar high-write scheduled jobs, pause them immediately. Lower backfill concurrency to near-zero until member space recovers.

Table partitioning migration with dual writes and backfill creates sustained MultiXact pressure

6Initiate manual data-only vacuum on high dead-tuple tables

If emergency autovacuum is insufficient or blocked, manually run VACUUM (FREEZE, VERBOSE) on tables with highest `n_dead_tup` counts, but skip index vacuums for faster recovery. Use the data-only vacuum strategy—indexes can wait, you need to reclaim member space now. Focus on the oldest tables by MultiXact age first. Increase autovacuum workers temporarily if you have CPU headroom. Each table vacuum on large databases takes hours, so prioritize ruthlessly.

MultiXact member space exhaustion triggers emergency vacuum and write operation lockout n_dead_tup