Ray Serve Backpressure Load Shedding
warningRay Serve deployments reject requests with 503 errors when max_queued_requests limit is exceeded, indicating healthy load shedding but potential capacity issues.
Monitor for 503 HTTP status codes in Ray Serve responses and 'Request dropped due to backpressure' warnings in proxy logs. Track ray_serve_count_http_error_requested and ray_scheduler_unscheduleable_tasks metrics rising. Check when ongoing requests (ray_grpc_server_requested_handling) equals max_ongoing_requests * num_replicas, indicating saturation.
This is expected load shedding behavior protecting system stability. First implement exponential backoff with jitter in clients (backoff_factor=1, total=5 retries, timeout=10s). Then evaluate if 503 rate exceeds acceptable error budget: if so, increase num_replicas in deployment config or increase max_ongoing_requests per replica if CPU/memory headroom exists (check ray_component_cpu_percentage and ray_component_rss). Consider implementing autoscaling based on ray_grpc_server_requested_handling approaching max_ongoing_requests threshold.