bentoml.runner.processing_latency
Runner processing latencyDimensions:None
Available on:
Prometheus (1)
Interface Metrics (1)
Dimensions:None
Technical Annotations (62)
Configuration Parameters (13)
threadsrecommended: N (where N > 1)workersrecommended: 2max_batch_sizerecommended: 15max_latency_msrecommended: 1000max_latencyrunner.batching.target_latency_msrecommended: 0runner.batching.strategyrunner.batching.max_batch_sizerunner.batching.max_latencyrunner.max_latencyrecommended: 60000batchablerecommended: Truebatch_dimrecommended: 0runners.<runner_name>.batching.max_latency_msrecommended: increase value when 503s occurError Signatures (6)
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnectedexceptionEOFexception0http status503http statusBentoML has detected that a service has a max latency that is likely too low for servinglog patternGeneration Stallslog patternCLI Commands (1)
k6 run --http-debug="full"diagnosticTechnical References (42)
async_run_methodcomponentremote runner handlecomponent/usr/local/lib/python3.9/site-packages/bentoml/_internal/runner/runner_handle/remote.pyfile pathfeature driftconceptconcept driftconceptprediction latencyconceptresource exhaustionconcept@bentoml.taskcomponent@bentoml.servicecomponentbatch endpointconceptsync API methodconcept__call__componentRunnercomponentAPI ServercomponentML Model processcomponentadaptive batching algorithmcomponentbatch enginecomponent@bentoml.api decoratorcomponentadaptive batchingconceptRunnerAppcomponentmicro-batchingconceptCORK algorithmcomponentCorkDispatchercomponentdispatcher.pyfile pathTime-Between-TokensconceptDecode stagecomponentautoregressive generationconceptiterative generation processconceptKV-cache hit ratesconceptworkload-aware baselinesconceptNLURunnablecomponentbentoml.Runnable.methodcomponentasync_runcomponentRunnablecomponentTime-Between-Tokens (TBT)conceptDecode phasecomponentGeneration StallsconceptPagedAttentionconceptRadixAttentionconceptTime-to-First-Token (TTFT)conceptbackpressureconceptpredict_lockcomponentRelated Insights (23)
Runner async communication failure during model inferencecritical
▸
Training/serving skew causes production model performance degradation over timewarning
▸
Infrastructure anomalies correlate with prediction service degradationcritical
▸
BentoML task performance degrades over repeated batches due to single-thread limitwarning
▸
Custom metrics gathering adds 5ms latency overhead to inference requestsinfo
▸
Production mode introduces 10ms overhead per runner callwarning
▸
EOF errors and status code 0 under high concurrent load indicate connection dropswarning
▸
Variable-length inputs cause adaptive batching to underperform or slow down inferencewarning
▸
All runner workers busy prevents new request processingcritical
▸
Suboptimal batch sizes reduce throughput efficiencywarning
▸
HTTP 503 errors when adaptive batching exceeds max_latency_mswarning
▸
Insufficient visibility into adaptive batching decisions impacts troubleshootingwarning
▸
Batching strategy configuration requires load testing before production deploymentwarning
▸
HTTP 503 errors when max_latency_ms is too low for model processing timecritical
▸
Generation stalls cause multi-second inter-token delays degrading user experiencecritical
▸
Aggregated request-level metrics mask micro-stalls in token generationwarning
▸
Static latency thresholds fail under variable request length causing false positiveswarning
▸
Adaptive batching shows no performance improvement over sequential processingwarning
▸
HTTP 503 errors when adaptive batching cannot meet max_latency_ms constraintwarning
▸
Non-batchable parameter types bypass adaptive batching optimizationinfo
▸
Generation stalls cause multi-second pauses in LLM token generation violating SLAscritical
▸
Prefill latency exhibits extreme variance making baseline modeling unreliableinfo
▸
Streaming generators complete immediately preventing natural backpressurewarning
▸