BentoML

MaxConcurrencyMiddleware returns 503 under load

warning
Resource ContentionUpdated Mar 7, 2026(via Exa)
Technologies:
How to detect:

When concurrent requests exceed traffic.max_concurrency per worker, MaxConcurrencyMiddleware immediately returns 503 Service Unavailable without queuing, causing client failures during traffic spikes.

Recommended action:

Monitor bentoml.api_server.request.in_progress against traffic.max_concurrency limit. Scale worker processes horizontally before hitting this threshold. Implement client-side retry with exponential backoff for 503 responses.