BentoML

Adaptive batching exceeds max_batch_size causing OOM when inputs have batch dimension

critical
Resource ContentionUpdated May 12, 2023(via Exa)
Technologies:
How to detect:

When using adaptive batching with max_batch_size=32, if input arrays already contain a batch dimension (e.g., shape [32, 3, 224, 224]), BentoML batches multiple such requests together, creating effective batch sizes far exceeding max_batch_size (e.g., 11 requests × 32 images = 352 images). This causes GPU memory usage to spike well above expected levels for max_batch_size and can trigger out-of-memory errors.

Recommended action:

Verify the fixed behavior exists in BentoML versions after PR #3973 (merged July 2023). If using older versions (1.0.19-1.0.20), either: 1) Upgrade to a version with the fix, 2) Reduce input batch sizes to 1 before calling runner.async_run() to allow BentoML to batch safely, or 3) Lower max_batch_size configuration to account for pre-batched inputs (divide desired batch size by typical input batch dimension). Monitor bentoml.runner.adaptive_batch.size metric and GPU memory usage to detect when actual batch sizes exceed expectations.