BentoML adaptive batching configuration for optimal throughput

info

performanceUpdated Apr 16, 2024(via Exa)

Sources

Monitoring Metrics in BentoML with Prometheus and Grafanawww.bentoml.com

Technologies:

BentoMLsubject

How to detect:

To optimize inference throughput and latency, BentoML Runners should be configured with appropriate max_batch_size and max_latency_ms settings based on workload characteristics.

Recommended action:

Configure Runner method_configs with max_batch_size (e.g., 50 for moderate batching) and max_latency_ms (e.g., 600ms for acceptable wait time). These parameters control how BentoML accumulates requests before inference. Adjust based on observed latency vs throughput tradeoffs using metrics like bentoml.runner.adaptive_batch.size and bentoml.runner.adaptive_batch.wait_duration.