Technologies/BentoML/bentoml.runner.processing_latency
BentoMLBentoMLMetric

bentoml.runner.processing_latency

Runner processing latency
Dimensions:None
Available on:PrometheusPrometheus (1)
Interface Metrics (1)
PrometheusPrometheus
Histogram of runner processing latency in seconds
Dimensions:None

Technical Annotations (62)

Configuration Parameters (13)
threadsrecommended: N (where N > 1)
Set in @service() decorator to allow concurrent task execution in sync methods
workersrecommended: 2
Service-level worker count as shown in example configuration
max_batch_sizerecommended: 15
Task batch size limit for the degrading operation
max_latency_msrecommended: 1000
Maximum latency threshold before batch is processed
max_latency
Current batching parameter with limited control over batch composition logic
runner.batching.target_latency_msrecommended: 0
Controls dispatcher wait time before executing requests; 0 minimizes wait after bursts
runner.batching.strategy
Strategy selection option added but requires load testing to determine optimal value
runner.batching.max_batch_size
Moved into strategy_options; impacts batch formation
runner.batching.max_latency
Moved into strategy_options; controls maximum acceptable latency
runner.max_latencyrecommended: 60000
Default is 60000ms (60 seconds), increase if seeing 503 errors
batchablerecommended: True
Must be True on Runnable.method decorator for batching to work
batch_dimrecommended: 0
Specifies which tensor dimension to batch along
runners.<runner_name>.batching.max_latency_msrecommended: increase value when 503s occur
maximum latency in milliseconds that a batch waits before releasing for inferencing
Error Signatures (6)
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnectedexception
EOFexception
0http status
503http status
BentoML has detected that a service has a max latency that is likely too low for servinglog pattern
Generation Stallslog pattern
CLI Commands (1)
k6 run --http-debug="full"diagnostic
Technical References (42)
async_run_methodcomponentremote runner handlecomponent/usr/local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/remote.pyfile pathfeature driftconceptconcept driftconceptprediction latencyconceptresource exhaustionconcept@bentoml.taskcomponent@bentoml.servicecomponentbatch endpointconceptsync API methodconcept__call__componentRunnercomponentAPI ServercomponentML Model processcomponentadaptive batching algorithmcomponentbatch enginecomponent@bentoml.api decoratorcomponentadaptive batchingconceptRunnerAppcomponentmicro-batchingconceptCORK algorithmcomponentCorkDispatchercomponentdispatcher.pyfile pathTime-Between-TokensconceptDecode stagecomponentautoregressive generationconceptiterative generation processconceptKV-cache hit ratesconceptworkload-aware baselinesconceptNLURunnablecomponentbentoml.Runnable.methodcomponentasync_runcomponentRunnablecomponentTime-Between-Tokens (TBT)conceptDecode phasecomponentGeneration StallsconceptPagedAttentionconceptRadixAttentionconceptTime-to-First-Token (TTFT)conceptbackpressureconceptpredict_lockcomponent
Related Insights (23)
Runner async communication failure during model inferencecritical
Training/serving skew causes production model performance degradation over timewarning
Infrastructure anomalies correlate with prediction service degradationcritical
BentoML task performance degrades over repeated batches due to single-thread limitwarning
Custom metrics gathering adds 5ms latency overhead to inference requestsinfo
Production mode introduces 10ms overhead per runner callwarning
EOF errors and status code 0 under high concurrent load indicate connection dropswarning
Variable-length inputs cause adaptive batching to underperform or slow down inferencewarning
All runner workers busy prevents new request processingcritical
Suboptimal batch sizes reduce throughput efficiencywarning
HTTP 503 errors when adaptive batching exceeds max_latency_mswarning
Insufficient visibility into adaptive batching decisions impacts troubleshootingwarning
Batching strategy configuration requires load testing before production deploymentwarning
HTTP 503 errors when max_latency_ms is too low for model processing timecritical
Generation stalls cause multi-second inter-token delays degrading user experiencecritical
Aggregated request-level metrics mask micro-stalls in token generationwarning
Static latency thresholds fail under variable request length causing false positiveswarning
Adaptive batching shows no performance improvement over sequential processingwarning
HTTP 503 errors when adaptive batching cannot meet max_latency_ms constraintwarning
Non-batchable parameter types bypass adaptive batching optimizationinfo
Generation stalls cause multi-second pauses in LLM token generation violating SLAscritical
Prefill latency exhibits extreme variance making baseline modeling unreliableinfo
Streaming generators complete immediately preventing natural backpressurewarning