bentoml.api_server.request.duration
API server request durationDimensions:None
Technical Annotations (82)
Configuration Parameters (24)
mb_max_latencyrecommended: 720000timeoutrecommended: value appropriate for prediction durationthreadsrecommended: N (where N > 1)workersrecommended: 2max_batch_sizerecommended: 15max_latency_msrecommended: 1000metrics.duration.bucketsrecommended: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]metrics.duration.minmetrics.duration.maxmetrics.duration.factormonitoring.enabledrecommended: truemonitoring.typerecommended: defaultmonitoring.options.log_pathrecommended: path/to/log/filetraffic.timeoutrecommended: 3600--timeoutrecommended: 540api_server.timeoutrecommended: 60runners.timeoutrecommended: 300max-latencyrecommended: 10sapi_server.metrics.duration.minrecommended: 0.1api_server.metrics.duration.maxrecommended: 5.0api_server.metrics.duration.factorrecommended: 2.0traffic.external_queuerecommended: truetraffic.concurrencyrecommended: required when external_queue is enabledrunners.batching.max_latency_msrecommended: 60000Error Signatures (7)
WORKER TIMEOUTlog patternServer disconnectedexception503http statusaiohttp.client_exceptions.ServerDisconnectedErrorexceptionbentoml.exceptions.RemoteExceptionexceptionNot able to process the request in 60.0 secondserror codeasyncio.exceptions.TimeoutErrorexceptionCLI Commands (9)
bentoml serve-gunicorn --timeout <seconds>remediationbentoml serve service:svc --timeout=3600remediationbentoml serve-gunicorn TimeoutIssue:latest --timeout 540diagnosticpip install bentoml --preremediationbentoml serve my_model:latest --productiondiagnosticbentoml serve my_model:latest --reloaddiagnosticdocker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATHremediationbentoml serve --max-latencydiagnosticdocker run -e BENTOML_CONFIG_OPTIONS='runners.timeout=3600' -it --rm -p 3000:3000 your_service serve --productionremediationTechnical References (42)
mb_max_latencycomponentgunicorn worker timeoutconceptfeature driftconceptconcept driftconcept@bentoml.taskcomponent@bentoml.servicecomponentbatch endpointconceptsync API methodconceptbackground taskconceptasync methodconcepttask.get()componenttask.get_status()componentmonitoring APIcomponentbentoml.monitorcomponentconfiguration.ymlfile path/bentoml/_internal/server/http/traffic.pyfile pathTimeoutMiddlewarecomponent@bentoml.asgi_appcomponentAPI server timeout configcomponentmiddlewarecomponentMarshalServicecomponentbentoml/server/marshal_server.pyfile pathaiohttp clientcomponentadaptive batchingcomponentrunner processcomponentAPI server processcomponentscappy_runnercomponentconfiguration.yamlfile pathtrafficcomponentrequest_duration_secondscomponentHistogramconceptgRPC deadlineprotocolBentoServercomponentrunnerscomponenthistogram bucketsconceptexternal request queuecomponentCORK algorithmcomponentautoregressive generationconceptiterative generation processconceptTime-Between-Tokens (TBT)conceptDecode phasecomponentGeneration StallsconceptRelated Insights (21)
Gunicorn worker timeout causes 503 errors despite high configured mb_max_latencycritical
▸
Training/serving skew causes production model performance degradation over timewarning
▸
BentoML task performance degrades over repeated batches due to single-thread limitwarning
▸
BentoML tasks block synchronous execution flow instead of running in backgroundwarning
▸
Histogram bucket configuration impacts cardinality and accuracyinfo
▸
ML model performance degradation due to unmonitored driftcritical
▸
BentoML timeout middleware enforces 60-second default regardless of configured timeout for mounted FastAPI appswarning
▸
API server requests timeout without server-side configurationwarning
▸
Internal marshal service timeout not synchronized with serve-gunicorn timeout argumentcritical
▸
Production flag causes high inter-process communication overhead in model servingwarning
▸
API server timeout configuration fails to terminate long-running requestswarning
▸
Timeout expiration causes request failures on long-running inference taskswarning
▸
Histogram bucket misconfiguration causes incomplete latency distributionwarning
▸
API server timeout configuration prevents request timeoutswarning
▸
BentoML 0.13-LTS batching latency excluded from request duration metricswarning
▸
BentoML metrics exclude socket IO time from request durationinfo
▸
Request duration histogram bucket misconfiguration causes inaccurate latency trackingwarning
▸
External queue increases Service latency due to extra I/O operationsinfo
▸
Increased latency from excessive max_latency_ms allowing oversized batcheswarning
▸
Aggregated request-level metrics mask micro-stalls in token generationwarning
▸
Generation stalls cause multi-second pauses in LLM token generation violating SLAscritical
▸