BentoML adaptive batching configuration for optimal throughput
infoperformanceUpdated Apr 16, 2024(via Exa)
Technologies:
How to detect:
To optimize inference throughput and latency, BentoML Runners should be configured with appropriate max_batch_size and max_latency_ms settings based on workload characteristics.
Recommended action:
Configure Runner method_configs with max_batch_size (e.g., 50 for moderate batching) and max_latency_ms (e.g., 600ms for acceptable wait time). These parameters control how BentoML accumulates requests before inference. Adjust based on observed latency vs throughput tradeoffs using metrics like bentoml.runner.adaptive_batch.size and bentoml.runner.adaptive_batch.wait_duration.