MaxConcurrencyMiddleware returns 503 under load
warningResource ContentionUpdated Mar 7, 2026(via Exa)
Technologies:
How to detect:
When concurrent requests exceed traffic.max_concurrency per worker, MaxConcurrencyMiddleware immediately returns 503 Service Unavailable without queuing, causing client failures during traffic spikes.
Recommended action:
Monitor bentoml.api_server.request.in_progress against traffic.max_concurrency limit. Scale worker processes horizontally before hitting this threshold. Implement client-side retry with exponential backoff for 503 responses.