Upstream Backend Timeout Causing 502 Errors

criticalIncident Response

NGINX returning 502 Bad Gateway due to backend servers timing out or failing to respond within configured limits.

Prompt: “I'm getting 502 Bad Gateway errors from NGINX and my logs show 'upstream timed out'. How do I tell if this is a backend performance issue versus an NGINX timeout configuration problem?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating 502 Bad Gateway errors from NGINX, start by confirming the error type and checking for upstream connection failures. Then examine timeout configurations for mismatches between NGINX and backend settings. Finally, investigate backend performance issues including response time patterns, resource exhaustion, and event loop blocking in async applications.

1Confirm error type and distribution

Start by checking `nginx-server-zone-responses-code` to confirm you're seeing 502s specifically, not 503s or 504s. As noted in the multi-layer analysis insight, 502 indicates upstream connection failures (backend not reachable or refusing connections), while 503 suggests NGINX capacity issues and 504 indicates timeout. If you're seeing 502s, the problem is either the backend crashing/unavailable or NGINX can't establish connections to it. Look at the rate and pattern — are these constant or spiking during certain times?

HTTP Error Rate Spikes Require Multi-Layer Analysis nginx_server_zone_responses_codenginx_server_zone_responses_5xx

2Check upstream connection failure metrics

Look at `nginx-upstream-peers-fails` and `nginx-upstream-peers-unavail` to see if backends are actively failing health checks or becoming unavailable. If `nginx-upstream-peers-unavail` is incrementing, backends are being marked down due to consecutive failures reaching max_fails threshold. Check `nginx-upstream-peers-downtime` to see how long backends have been unavailable. This tells you whether it's a true backend failure versus a timeout configuration issue.

HTTP Error Rate Spikes Require Multi-Layer Analysis nginx_upstream_peers_failsnginx_upstream_peers_unavailnginx_upstream_peers_downtime

3Compare proxy and backend timeout settings

Check for timeout configuration mismatches — if your NGINX proxy_read_timeout is 60 seconds but your backend (Gunicorn, FastAPI, etc.) has a 60-second timeout, the proxy will kill the connection right when the backend is about to respond. The proxy timeout should be significantly longer — if your backend timeout is 60s, set NGINX to 90+ seconds. Also verify proxy_connect_timeout isn't too short for backend startup. This is one of the most common causes of 502s that looks like a backend issue but is actually configuration.

Reverse proxy timeout shorter than Gunicorn timeout prematurely kills connections Nginx proxy timeouts cause HTTP 408 when Meilisearch is slow to respond nginx_upstream_peers_response_time

4Analyze backend response time patterns

Look at `nginx-upstream-peers-response-time` and `nginx-upstream-peers-response-time-histogram` to see if backends are consistently slow or hitting timeout thresholds. Check `nginx-upstream-peers-header-time` specifically — if this is high, backends are taking too long just to start responding. Compare p95/p99 values from the histogram to your timeout settings. If response times are creeping up to match your timeout values (e.g., 58-59 seconds with a 60s timeout), you're likely hitting genuine backend performance problems, not just configuration issues.

nginx_upstream_peers_response_timenginx_upstream_peers_response_time_histogramnginx_upstream_peers_header_time

5Check backend health and resource exhaustion

Review `nginx-upstream-peers-health-checks-fails` and check if backends are failing health checks before user requests fail — this is an early warning signal. Then investigate backend resource usage: check for memory swapping (which can cause 16+ second response times leading to timeouts), CPU saturation, or connection pool exhaustion. If running multiple applications per server, heavy swap usage (>30% of RAM) is a common cause of backends becoming unresponsive and triggering 502s.

Health Check Failures Indicate Upstream Degradation Memory swapping causes slow response times and Nginx 504 timeouts nginx_upstream_peers_health_checks_failsnginx_upstream_peers_responses_5xx

6Investigate event loop blocking in async backends

If you're using async backends (FastAPI, Node.js) and seeing `nginx-upstream-peers-response-time` increase while `nginx-upstream-peers-active` stays low with moderate CPU (<70%), you likely have event loop blocking. This happens when async applications make blocking I/O calls (synchronous ORM queries, CPU-heavy JSON processing), causing requests to process serially despite async infrastructure. The backend appears to have capacity but requests queue up behind blocking operations, eventually timing out and causing 502s.

Event Loop Blocking Causes Serial Request Processing nginx_upstream_peers_response_timenginx_upstream_peers_active