Kubernetes Ingress Controller 503 Service Unavailable

criticalIncident Response

NGINX Ingress Controller returning 503 errors due to pod readiness, service configuration, or deployment timing issues.

Prompt: My NGINX Ingress Controller in Kubernetes is intermittently returning 503 errors during deployments and sometimes even under normal traffic. How do I diagnose whether this is a pod readiness issue, service misconfiguration, or resource problem?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing 503 errors in NGINX Ingress Controller, start by checking if your backend pods are actually healthy and passing readiness probes. Then differentiate whether you're hitting NGINX worker capacity limits or if backends are rejecting connections. During deployments, watch for timing issues where old pods terminate before new ones are fully ready.

1Verify backend pod health and readiness probe status
Start with `nginx_upstream_peers_health_checks_fails` and `nginx_upstream_peers_health_checks_unhealthy` to see if your backend pods are actually passing health checks. In Kubernetes, a pod can be Running but not Ready, and NGINX will return 503 if no healthy upstreams exist. Check if `nginx_upstream_peers_health_checks_unhealthy` > 0 or if failures are increasing — this tells you the issue is with your application pods, not NGINX itself. Correlate with `nginx_upstream_peers_downtime` to see how long backends have been unavailable.
2Confirm you're actually seeing 503s vs 502/504 errors
Use `nginx_server_zone_responses_code` to get the exact HTTP status code distribution. A 503 specifically means NGINX is overloaded or has no available upstreams, while 502 indicates upstream connection failures and 504 means timeouts. If you're seeing 502s, the problem is network connectivity or backends crashing; if 504s, your backends are too slow. Don't assume all 5xx errors have the same root cause — each code points to a different layer of the stack.
3Check for deployment timing issues and pod churn
During rolling updates, watch `nginx_upstream_peers_unavail` for spikes — this counts how many times upstreams became unavailable due to max_fails threshold. If you see this jumping during deployments, your old pods are terminating before new ones pass readiness probes, leaving NGINX with zero healthy backends momentarily. Review your deployment strategy (maxSurge, maxUnavailable) and ensure readiness probe intervals are faster than your terminationGracePeriodSeconds. Check `nginx_upstream_peers_downtime` to quantify how long you're operating with degraded backend capacity.
4Look for NGINX worker thread saturation
Check if `nginx_server_zone_processing` is sustained at high levels while `nginx_upstream_peers_health_checks_unhealthy` is zero — this means NGINX itself is the bottleneck, not your backends. If you're hitting worker capacity, NGINX will return 503 even though backends are healthy and CPU might only be at 50-60%. This is often caused by slow backend responses filling up the worker thread pool, so also look at backend response times. Increase worker_connections and worker_processes if you're hitting this limit.
5Check for connection pool exhaustion to backends
Compare `nginx_upstream_peers_active` against your configured connection limits (if set). If active connections are maxed out across multiple upstreams, a single misbehaving service can saturate the connection pool and cause 503s for other services on shared infrastructure. This is especially common when backends have limited concurrency (like PHP-FPM pools). Set per-upstream connection limits that match your backend capacity and configure acquire timeouts to fail fast rather than queueing indefinitely.
6Investigate upstream connection failures and network issues
Rising `nginx_upstream_peers_fails` indicates NGINX cannot establish connections to backends even though they appear healthy. In Kubernetes this often means DNS resolution issues, network policy blocks, or pods being in CrashLoopBackOff with intermittent readiness. Check `nginx_server_zone_discarded` for requests dropped without responses — high discard rates combined with connection failures suggest network-level problems between the Ingress Controller and your service endpoints.

Technologies

Related Insights

Health Check Failures Indicate Upstream Degradation
warning
Upstream backend health check failures (nginx_stream_upstream_peers_health_checks_fails, nginx_stream_upstream_peers_health_checks_unhealthy) provide early warning of backend degradation before user-facing errors occur. These often precede increases in nginx_upstream_peers_fails and nginx_upstream_peers_downtime.
HTTP Error Rate Spikes Require Multi-Layer Analysis
critical
Increases in nginx_server_zone_responses_4xx or nginx_server_zone_responses_5xx require differentiation between client errors (4xx), NGINX configuration issues (502/503), and upstream failures (504, backend 5xx). The same metric can indicate completely different root causes depending on code distribution.
Upstream Connection Pool Saturation Blocks NGINX Workers
critical
When NGINX proxies to backends (PHP-FPM, FastCGI, uwsgi) without proper connection limits, a single site can exhaust the proxy connection pool, blocking other sites on shared infrastructure. Default mod_proxy behavior allows each child process to open max connections equal to ThreadsPerChild × ServerLimit.
Request Queue Buildup Indicates Worker Exhaustion
critical
When NGINX reaches MaxRequestWorkers (or event MPM's equivalent capacity), new requests queue at the load balancer level, causing gateway timeouts even when CPU and dependencies appear healthy. This often manifests as moderate CPU (~50-60%) but rising tail latency and 502/504 errors.
Worker saturation manifests as nginx 499 status codes
warning

Relevant Metrics

Monitoring Interfaces

NGINX Datadog
NGINX OpenTelemetry