Kubernetes Ingress Controller 503 Service Unavailable

criticalIncident Response

NGINX Ingress Controller returning 503 errors due to pod readiness, service configuration, or deployment timing issues.

Prompt: “My NGINX Ingress Controller in Kubernetes is intermittently returning 503 errors during deployments and sometimes even under normal traffic. How do I diagnose whether this is a pod readiness issue, service misconfiguration, or resource problem?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing 503 errors in NGINX Ingress Controller, start by checking if your backend pods are actually healthy and passing readiness probes. Then differentiate whether you're hitting NGINX worker capacity limits or if backends are rejecting connections. During deployments, watch for timing issues where old pods terminate before new ones are fully ready.

1Verify backend pod health and readiness probe status

Start with `nginx_upstream_peers_health_checks_fails` and `nginx_upstream_peers_health_checks_unhealthy` to see if your backend pods are actually passing health checks. In Kubernetes, a pod can be Running but not Ready, and NGINX will return 503 if no healthy upstreams exist. Check if `nginx_upstream_peers_health_checks_unhealthy` > 0 or if failures are increasing — this tells you the issue is with your application pods, not NGINX itself. Correlate with `nginx_upstream_peers_downtime` to see how long backends have been unavailable.

Health Check Failures Indicate Upstream Degradation nginx_upstream_peers_health_checks_failsnginx_upstream_peers_health_checks_unhealthynginx_upstream_peers_downtime

2Confirm you're actually seeing 503s vs 502/504 errors

Use `nginx_server_zone_responses_code` to get the exact HTTP status code distribution. A 503 specifically means NGINX is overloaded or has no available upstreams, while 502 indicates upstream connection failures and 504 means timeouts. If you're seeing 502s, the problem is network connectivity or backends crashing; if 504s, your backends are too slow. Don't assume all 5xx errors have the same root cause — each code points to a different layer of the stack.

HTTP Error Rate Spikes Require Multi-Layer Analysis nginx_server_zone_responses_codenginx_server_zone_responses_5xx

3Check for deployment timing issues and pod churn

During rolling updates, watch `nginx_upstream_peers_unavail` for spikes — this counts how many times upstreams became unavailable due to max_fails threshold. If you see this jumping during deployments, your old pods are terminating before new ones pass readiness probes, leaving NGINX with zero healthy backends momentarily. Review your deployment strategy (maxSurge, maxUnavailable) and ensure readiness probe intervals are faster than your terminationGracePeriodSeconds. Check `nginx_upstream_peers_downtime` to quantify how long you're operating with degraded backend capacity.

Health Check Failures Indicate Upstream Degradation nginx_upstream_peers_unavailnginx_upstream_peers_downtimenginx_upstream_peers_health_checks_checks

4Look for NGINX worker thread saturation

Check if `nginx_server_zone_processing` is sustained at high levels while `nginx_upstream_peers_health_checks_unhealthy` is zero — this means NGINX itself is the bottleneck, not your backends. If you're hitting worker capacity, NGINX will return 503 even though backends are healthy and CPU might only be at 50-60%. This is often caused by slow backend responses filling up the worker thread pool, so also look at backend response times. Increase worker_connections and worker_processes if you're hitting this limit.

Request Queue Buildup Indicates Worker Exhaustion nginx_server_zone_processingnginx_upstream_peers_health_checks_unhealthy

5Check for connection pool exhaustion to backends

Compare `nginx_upstream_peers_active` against your configured connection limits (if set). If active connections are maxed out across multiple upstreams, a single misbehaving service can saturate the connection pool and cause 503s for other services on shared infrastructure. This is especially common when backends have limited concurrency (like PHP-FPM pools). Set per-upstream connection limits that match your backend capacity and configure acquire timeouts to fail fast rather than queueing indefinitely.

Upstream Connection Pool Saturation Blocks NGINX Workers nginx_upstream_peers_activenginx_upstream_peers_fails

6Investigate upstream connection failures and network issues

Rising `nginx_upstream_peers_fails` indicates NGINX cannot establish connections to backends even though they appear healthy. In Kubernetes this often means DNS resolution issues, network policy blocks, or pods being in CrashLoopBackOff with intermittent readiness. Check `nginx_server_zone_discarded` for requests dropped without responses — high discard rates combined with connection failures suggest network-level problems between the Ingress Controller and your service endpoints.

HTTP Error Rate Spikes Require Multi-Layer Analysis nginx_upstream_peers_failsnginx_server_zone_discarded