Service autoscaling fails when concurrency is not configured

warning

configurationUpdated Mar 24, 2026

Sources

Concurrency and autoscaling - BentoML Documentationdocs.bentoml.com

Technologies:

BentoMLsubject

How to detect:

When concurrency is not set on a BentoML Service, autoscaling defaults to CPU-based scaling only, which may not be optimal for the Service workload and can result in poor scaling behavior

Recommended action:

Configure the concurrency parameter in the @bentoml.service decorator. Conduct a stress test using a load generation tool like Locust to identify the maximum concurrent requests your Service can handle, then set concurrency to a value slightly below this threshold. For adaptive/continuous batching Services, set concurrency to match batch size. For sequential processing Services, set concurrency to 1.