Technologies/Prometheus/bentoml.api_server.request.duration

PrometheusMetric

bentoml.api_server.request.duration

API server request duration

Dimensions:None

Technical Annotations (82)

Configuration Parameters (24)

mb_max_latencyrecommended: 720000

controls max batching latency tolerance, not worker timeout; example shows 720000ms for ~10 min predictions

timeoutrecommended: value appropriate for prediction duration

gunicorn worker timeout via --timeout flag; must exceed prediction time; default 60s is insufficient for long-running predictions

threadsrecommended: N (where N > 1)

Set in @service() decorator to allow concurrent task execution in sync methods

workersrecommended: 2

Service-level worker count as shown in example configuration

max_batch_sizerecommended: 15

Task batch size limit for the degrading operation

max_latency_msrecommended: 1000

Maximum latency threshold before batch is processed

metrics.duration.bucketsrecommended: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]

Explicit histogram buckets for duration metrics

metrics.duration.min

Minimum bucket for exponential distribution

metrics.duration.max

Maximum bucket for exponential distribution

metrics.duration.factor

Growth factor for exponential buckets

monitoring.enabledrecommended: true

Enables BentoML monitoring capabilities

monitoring.typerecommended: default

Specifies monitoring backend type

monitoring.options.log_pathrecommended: path/to/log/file

Destination for monitoring data logs

traffic.timeoutrecommended: 3600

Service-level timeout configuration that should apply to mounted ASGI apps

--timeoutrecommended: 540

CLI argument for serve-gunicorn command specifying worker timeout in seconds

api_server.timeoutrecommended: 60

Default timeout for API server requests in seconds — reportedly not enforced

runners.timeoutrecommended: 300

Default timeout for runner execution in seconds — reportedly not enforced

max-latencyrecommended: 10s

default API server maximum latency target

api_server.metrics.duration.minrecommended: 0.1

Minimum expected request duration in seconds for histogram tracking

api_server.metrics.duration.maxrecommended: 5.0

Maximum expected request duration in seconds for histogram tracking

api_server.metrics.duration.factorrecommended: 2.0

Exponential factor controlling bucket granularity - smaller values create more buckets

traffic.external_queuerecommended: true

Enables request buffering but adds latency overhead

traffic.concurrencyrecommended: required when external_queue is enabled

Must be specified if external_queue is true

runners.batching.max_latency_msrecommended: 60000

Default is 60000ms (60 seconds), reduce if latency SLAs are tighter

Error Signatures (7)

WORKER TIMEOUTlog pattern

Server disconnectedexception

503http status

aiohttp.client_exceptions.ServerDisconnectedErrorexception

bentoml.exceptions.RemoteExceptionexception

Not able to process the request in 60.0 secondserror code

asyncio.exceptions.TimeoutErrorexception

CLI Commands (9)

bentoml serve-gunicorn --timeout <seconds>remediation

bentoml serve service:svc --timeout=3600remediation

bentoml serve-gunicorn TimeoutIssue:latest --timeout 540diagnostic

pip install bentoml --preremediation

bentoml serve my_model:latest --productiondiagnostic

bentoml serve my_model:latest --reloaddiagnostic

docker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATHremediation

bentoml serve --max-latencydiagnostic

docker run -e BENTOML_CONFIG_OPTIONS='runners.timeout=3600' -it --rm -p 3000:3000 your_service serve --productionremediation

Technical References (42)

mb_max_latencycomponentgunicorn worker timeoutconceptfeature driftconceptconcept driftconcept@bentoml.taskcomponent@bentoml.servicecomponentbatch endpointconceptsync API methodconceptbackground taskconceptasync methodconcepttask.get()componenttask.get_status()componentmonitoring APIcomponentbentoml.monitorcomponentconfiguration.ymlfile path/bentoml/_internal/server/http/traffic.pyfile pathTimeoutMiddlewarecomponent@bentoml.asgi_appcomponentAPI server timeout configcomponentmiddlewarecomponentMarshalServicecomponentbentoml/server/marshal_server.pyfile pathaiohttp clientcomponentadaptive batchingcomponentrunner processcomponentAPI server processcomponentscappy_runnercomponentconfiguration.yamlfile pathtrafficcomponentrequest_duration_secondscomponentHistogramconceptgRPC deadlineprotocolBentoServercomponentrunnerscomponenthistogram bucketsconceptexternal request queuecomponentCORK algorithmcomponentautoregressive generationconceptiterative generation processconceptTime-Between-Tokens (TBT)conceptDecode phasecomponentGeneration Stallsconcept

Related Insights (21)

Gunicorn worker timeout causes 503 errors despite high configured mb_max_latencycritical

▸

Training/serving skew causes production model performance degradation over timewarning

▸

BentoML task performance degrades over repeated batches due to single-thread limitwarning

▸

BentoML tasks block synchronous execution flow instead of running in backgroundwarning

▸

Histogram bucket configuration impacts cardinality and accuracyinfo

▸

ML model performance degradation due to unmonitored driftcritical

▸

BentoML timeout middleware enforces 60-second default regardless of configured timeout for mounted FastAPI appswarning

▸

API server requests timeout without server-side configurationwarning

▸

Internal marshal service timeout not synchronized with serve-gunicorn timeout argumentcritical

▸

Production flag causes high inter-process communication overhead in model servingwarning

▸

API server timeout configuration fails to terminate long-running requestswarning

▸

Timeout expiration causes request failures on long-running inference taskswarning

▸

Histogram bucket misconfiguration causes incomplete latency distributionwarning

▸

API server timeout configuration prevents request timeoutswarning

▸

BentoML 0.13-LTS batching latency excluded from request duration metricswarning

▸

BentoML metrics exclude socket IO time from request durationinfo

▸

Request duration histogram bucket misconfiguration causes inaccurate latency trackingwarning

▸

External queue increases Service latency due to extra I/O operationsinfo

▸

Increased latency from excessive max_latency_ms allowing oversized batcheswarning

▸

Aggregated request-level metrics mask micro-stalls in token generationwarning

▸

Generation stalls cause multi-second pauses in LLM token generation violating SLAscritical

▸