Technologies/BentoML/bentoml.runner.adaptive_batch.size

BentoMLMetric

bentoml.runner.adaptive_batch.size

Adaptive batch size

Dimensions:None

Available on:

Prometheus (1)

OpenTelemetry (1)

Datadog (1)

Interface Metrics (3)

Prometheus

bentoml_runner_adaptive_batch_size

Current adaptive batch size for the runner

Dimensions:None

OpenTelemetry

bentoml_runner_adaptive_batch_size_total

Total batch size for adaptive batching in runner

Dimensions:None

Datadog

bentoml.runner.adaptive_batch.size

Size of batches processed by adaptive batching runner

Dimensions:None

Sources

bentoml_runner_adaptive_batch_sizegithub.com

bentoml_runner_adaptive_batch_size_totaldocs.bentoml.com

bentoml.runner.adaptive_batch.sizedocs.bentoml.com

Technical Annotations (49)

Configuration Parameters (19)

max_batch_sizerecommended: 50

Maximum number of requests to batch together for inference in Runner method_configs

max_latency_msrecommended: 600

Maximum time to wait for batch accumulation before processing in Runner method_configs

batch_size

Must be copied to separate variable before training logic overrides

runners.batching.max_batch_sizerecommended: 5

Maximum number of requests to batch together for inference; must account for pre-batched client inputs

max_latency

Current batching parameter with limited control over batch composition logic

threads

parameter in @bentoml.service decorator to set concurrency level for synchronous endpoints

runners.<runner_name>.batching.max_batch_sizerecommended: 32

Must account for input batch dimensions; actual batch size = configured value × input batch dimension

runners.<runner_name>.batching.enabledrecommended: true

Enables adaptive batching where this bug manifests

runners.<runner_name>.batching.max_latency_msrecommended: 100000

Example value used in reproduction; controls batching window

batchablerecommended: True

Must be True for adaptive batching; should be disabled if batching not needed

traffic.concurrencyrecommended: match batch size

For adaptive/continuous batching Services, must equal batch size for optimal throughput

runner.batching.target_latency_msrecommended: 0

Controls dispatcher wait time before executing requests; 0 minimizes wait after bursts

runner.batching.strategy

Strategy selection option added but requires load testing to determine optimal value

runner.batching.max_batch_size

Moved into strategy_options; impacts batch formation

runner.batching.max_latency

Moved into strategy_options; controls maximum acceptable latency

@bentoml.service.threadsrecommended: N

Set to enable concurrent requests from sync endpoints to batchable services

batch_dimrecommended: 0

Specifies which tensor dimension to batch along

signatures.__call__.batchablerecommended: True

enables adaptive batching for models that support batched input

signatures.__call__.batch_dimrecommended: (0, 0)

2-tuple specifying input and output batch dimensions (defaults to 0)

Error Signatures (4)

UnboundedLocalError: local variable 'batch_size' referenced before assignmentexception

AssertionErrorexception

assert start < endexception

bentoml/_internal/utils/metrics.py", line 44, in exponential_bucketsexception

CLI Commands (1)

BENTOML_CONFIG=./configuration.yaml bentoml serve --productiondiagnostic

Technical References (25)

adaptive batchingconceptCorkDispatchercomponentCORK algorithmconceptdispatcher loopcomponentadaptive batching algorithmcomponentbatch enginecomponent@bentoml.service decoratorcomponentsynchronous endpointconcept@bentoml.api decoratorcomponentonnxruntime.InferenceSessioncomponentbentoml._internal.marshal.dispatchercomponentrunner.async_runcomponentexponential_bucketscomponentbentoml._internal.utils.metricsfile pathcontinuous batchingconceptRunnerAppcomponentmicro-batchingconcept@bentoml.servicecomponent@bentoml.api(batchable=True)componentv1/default_configuration.yamlfile pathNLURunnablecomponentbentoml.Runnable.methodcomponentasync_runcomponentRunnercomponentRunnablecomponent

Related Insights (22)

BentoML adaptive batching configuration for optimal throughputinfo

▸

Adaptive batching disabled prevents batch size metrics collectioninfo

▸

Adaptive batching queue depth causes latency spikeswarning

▸

BentoML 0.13-LTS batching latency excluded from request duration metricswarning

▸

UnboundedLocalError in dispatcher training logic when batch_size overriddencritical

▸

Pre-batched client requests can exceed configured max_batch_size limitwarning

▸

Variable-length inputs cause adaptive batching to underperform or slow down inferencewarning

▸

Batch splitting behavior may fragment requests across multiple execution cyclesinfo

▸

Suboptimal batch sizes reduce throughput efficiencywarning

▸

Synchronous endpoints limit batching throughput to one request at a timewarning

▸

Batch size constraint prevents optimal throughput under high loadwarning

▸

Adaptive batching exceeds max_batch_size causing OOM when inputs have batch dimensioncritical

▸

Adaptive batching crashes with max_batch_size=1 due to metric bucket assertion failurecritical

▸

Replica count undershoots or overshoots when concurrency mismatches batch sizewarning

▸

Insufficient visibility into adaptive batching decisions impacts troubleshootingwarning

▸

Batching strategy configuration requires load testing before production deploymentwarning

▸

Insufficient batching throughput when sync endpoints call batchable services with default concurrencywarning

▸

Memory exhaustion from excessive max_batch_size relative to GPU capacitywarning

▸

Adaptive batching shows no performance improvement over sequential processingwarning

▸

Max batch size determines memory capacity limits for GPU/memory resourceswarning

▸

Non-batchable parameter types bypass adaptive batching optimizationinfo

▸

Models must be explicitly declared batchable for adaptive batching to workinfo

▸