BentoML

Service autoscaling fails when concurrency is not configured

warning
configurationUpdated Mar 24, 2026
Technologies:
How to detect:

When concurrency is not set on a BentoML Service, autoscaling defaults to CPU-based scaling only, which may not be optimal for the Service workload and can result in poor scaling behavior

Recommended action:

Configure the concurrency parameter in the @bentoml.service decorator. Conduct a stress test using a load generation tool like Locust to identify the maximum concurrent requests your Service can handle, then set concurrency to a value slightly below this threshold. For adaptive/continuous batching Services, set concurrency to match batch size. For sequential processing Services, set concurrency to 1.