Technologies/Prometheus/bentoml.api_server.request.duration
PrometheusPrometheusMetric

bentoml.api_server.request.duration

API server request duration
Dimensions:None

Technical Annotations (82)

Configuration Parameters (24)
mb_max_latencyrecommended: 720000
controls max batching latency tolerance, not worker timeout; example shows 720000ms for ~10 min predictions
timeoutrecommended: value appropriate for prediction duration
gunicorn worker timeout via --timeout flag; must exceed prediction time; default 60s is insufficient for long-running predictions
threadsrecommended: N (where N > 1)
Set in @service() decorator to allow concurrent task execution in sync methods
workersrecommended: 2
Service-level worker count as shown in example configuration
max_batch_sizerecommended: 15
Task batch size limit for the degrading operation
max_latency_msrecommended: 1000
Maximum latency threshold before batch is processed
metrics.duration.bucketsrecommended: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
Explicit histogram buckets for duration metrics
metrics.duration.min
Minimum bucket for exponential distribution
metrics.duration.max
Maximum bucket for exponential distribution
metrics.duration.factor
Growth factor for exponential buckets
monitoring.enabledrecommended: true
Enables BentoML monitoring capabilities
monitoring.typerecommended: default
Specifies monitoring backend type
monitoring.options.log_pathrecommended: path/to/log/file
Destination for monitoring data logs
traffic.timeoutrecommended: 3600
Service-level timeout configuration that should apply to mounted ASGI apps
--timeoutrecommended: 540
CLI argument for serve-gunicorn command specifying worker timeout in seconds
api_server.timeoutrecommended: 60
Default timeout for API server requests in seconds — reportedly not enforced
runners.timeoutrecommended: 300
Default timeout for runner execution in seconds — reportedly not enforced
max-latencyrecommended: 10s
default API server maximum latency target
api_server.metrics.duration.minrecommended: 0.1
Minimum expected request duration in seconds for histogram tracking
api_server.metrics.duration.maxrecommended: 5.0
Maximum expected request duration in seconds for histogram tracking
api_server.metrics.duration.factorrecommended: 2.0
Exponential factor controlling bucket granularity - smaller values create more buckets
traffic.external_queuerecommended: true
Enables request buffering but adds latency overhead
traffic.concurrencyrecommended: required when external_queue is enabled
Must be specified if external_queue is true
runners.batching.max_latency_msrecommended: 60000
Default is 60000ms (60 seconds), reduce if latency SLAs are tighter
Error Signatures (7)
WORKER TIMEOUTlog pattern
Server disconnectedexception
503http status
aiohttp.client_exceptions.ServerDisconnectedErrorexception
bentoml.exceptions.RemoteExceptionexception
Not able to process the request in 60.0 secondserror code
asyncio.exceptions.TimeoutErrorexception
CLI Commands (9)
bentoml serve-gunicorn --timeout <seconds>remediation
bentoml serve service:svc --timeout=3600remediation
bentoml serve-gunicorn TimeoutIssue:latest --timeout 540diagnostic
pip install bentoml --preremediation
bentoml serve my_model:latest --productiondiagnostic
bentoml serve my_model:latest --reloaddiagnostic
docker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATHremediation
bentoml serve --max-latencydiagnostic
docker run -e BENTOML_CONFIG_OPTIONS='runners.timeout=3600' -it --rm -p 3000:3000 your_service serve --productionremediation
Technical References (42)
mb_max_latencycomponentgunicorn worker timeoutconceptfeature driftconceptconcept driftconcept@bentoml.taskcomponent@bentoml.servicecomponentbatch endpointconceptsync API methodconceptbackground taskconceptasync methodconcepttask.get()componenttask.get_status()componentmonitoring APIcomponentbentoml.monitorcomponentconfiguration.ymlfile path/bentoml/_internal/server/http/traffic.pyfile pathTimeoutMiddlewarecomponent@bentoml.asgi_appcomponentAPI server timeout configcomponentmiddlewarecomponentMarshalServicecomponentbentoml/server/marshal_server.pyfile pathaiohttp clientcomponentadaptive batchingcomponentrunner processcomponentAPI server processcomponentscappy_runnercomponentconfiguration.yamlfile pathtrafficcomponentrequest_duration_secondscomponentHistogramconceptgRPC deadlineprotocolBentoServercomponentrunnerscomponenthistogram bucketsconceptexternal request queuecomponentCORK algorithmcomponentautoregressive generationconceptiterative generation processconceptTime-Between-Tokens (TBT)conceptDecode phasecomponentGeneration Stallsconcept
Related Insights (21)
Gunicorn worker timeout causes 503 errors despite high configured mb_max_latencycritical
Training/serving skew causes production model performance degradation over timewarning
BentoML task performance degrades over repeated batches due to single-thread limitwarning
BentoML tasks block synchronous execution flow instead of running in backgroundwarning
Histogram bucket configuration impacts cardinality and accuracyinfo
ML model performance degradation due to unmonitored driftcritical
BentoML timeout middleware enforces 60-second default regardless of configured timeout for mounted FastAPI appswarning
API server requests timeout without server-side configurationwarning
Internal marshal service timeout not synchronized with serve-gunicorn timeout argumentcritical
Production flag causes high inter-process communication overhead in model servingwarning
API server timeout configuration fails to terminate long-running requestswarning
Timeout expiration causes request failures on long-running inference taskswarning
Histogram bucket misconfiguration causes incomplete latency distributionwarning
API server timeout configuration prevents request timeoutswarning
BentoML 0.13-LTS batching latency excluded from request duration metricswarning
BentoML metrics exclude socket IO time from request durationinfo
Request duration histogram bucket misconfiguration causes inaccurate latency trackingwarning
External queue increases Service latency due to extra I/O operationsinfo
Increased latency from excessive max_latency_ms allowing oversized batcheswarning
Aggregated request-level metrics mask micro-stalls in token generationwarning
Generation stalls cause multi-second pauses in LLM token generation violating SLAscritical