Ray Serve Backpressure Load Shedding

warning

Resource ContentionUpdated Feb 23, 2026

Ray Serve deployments reject requests with 503 errors when max_queued_requests limit is exceeded, indicating healthy load shedding but potential capacity issues.

Sources

Best practices in production #docs.ray.io

Technologies:

RaySymptoms of this issue are visible in Ray metrics and logs

How to detect:

Monitor for 503 HTTP status codes in Ray Serve responses and 'Request dropped due to backpressure' warnings in proxy logs. Track ray_serve_count_http_error_requested and ray_scheduler_unscheduleable_tasks metrics rising. Check when ongoing requests (ray_grpc_server_requested_handling) equals max_ongoing_requests * num_replicas, indicating saturation.

Recommended action:

This is expected load shedding behavior protecting system stability. First implement exponential backoff with jitter in clients (backoff_factor=1, total=5 retries, timeout=10s). Then evaluate if 503 rate exceeds acceptable error budget: if so, increase num_replicas in deployment config or increase max_ongoing_requests per replica if CPU/memory headroom exists (check ray_component_cpu_percentage and ray_component_rss). Consider implementing autoscaling based on ray_grpc_server_requested_handling approaching max_ongoing_requests threshold.