Technologies/BentoML/http.server.duration
BentoMLBentoMLMetric

http.server.duration

HTTP request duration
Dimensions:None
Available on:PrometheusPrometheus (1)OpenTelemetryOpenTelemetry (1)DatadogDatadog (1)
Interface Metrics (3)
PrometheusPrometheus
Histogram of service request duration in seconds
Dimensions:None
OpenTelemetryOpenTelemetry
Duration of HTTP requests handled by BentoML service
Dimensions:None
DatadogDatadog
Duration of HTTP requests to BentoML service endpoints in seconds
Dimensions:None

Technical Annotations (37)

Configuration Parameters (8)
mb_max_latencyrecommended: 720000
controls max batching latency tolerance, not worker timeout; example shows 720000ms for ~10 min predictions
timeoutrecommended: value appropriate for prediction duration
gunicorn worker timeout via --timeout flag; must exceed prediction time; default 60s is insufficient for long-running predictions
traffic.timeoutrecommended: 120
increase from default 60 seconds for long-running inference tasks
traffic.max_concurrencyrecommended: set below thread pool capacity
Limits concurrent requests per worker to prevent thread exhaustion
max_latency_ms
upper limit in milliseconds for end-to-end batch processing delay
traffic.external_queuerecommended: true
Enables request buffering but adds latency overhead
traffic.concurrencyrecommended: required when external_queue is enabled
Must be specified if external_queue is true
runner.max_latencyrecommended: 60000
Default is 60000ms (60 seconds), increase if seeing 503 errors
Error Signatures (10)
WORKER TIMEOUTlog pattern
Server disconnectedexception
503http status
aiohttp.client_exceptions.ServerDisconnectedErrorexception
bentoml.exceptions.RemoteExceptionexception
CRITICAL] WORKER TIMEOUTlog pattern
Worker exitinglog pattern
502http status
504http status
BentoML has detected that a service has a max latency that is likely too low for servinglog pattern
CLI Commands (1)
bentoml serve-gunicorn --timeout <seconds>remediation
Technical References (18)
mb_max_latencycomponentgunicorn worker timeoutconceptprediction latencyconceptresource exhaustionconceptAPI server timeout configcomponentmiddlewarecomponent@bentoml.servicecomponenttrafficcomponentanyio.to_thread.run_synccomponentcapacity limitercomponentGunicorn workercomponentworker timeoutconcept@bentoml.api decoratorcomponentadaptive batchingconceptexternal request queuecomponentCORK algorithmcomponentCorkDispatchercomponentdispatcher.pyfile path
Related Insights (12)
Gunicorn worker timeout causes 503 errors despite high configured mb_max_latencycritical
Infrastructure anomalies correlate with prediction service degradationcritical
API server requests timeout without server-side configurationwarning
Timeout expiration causes request failures on long-running inference taskswarning
Thread pool exhaustion prevents synchronous API method executioncritical
Request timeout configuration prevents long-running inferencewarning
Worker timeout and crash loop under concurrent request loadcritical
BentoML metrics exclude socket IO time from request durationinfo
API server request backlog indicates upstream bottleneckwarning
HTTP 503 errors when adaptive batching exceeds max_latency_mswarning
External queue increases Service latency due to extra I/O operationsinfo
HTTP 503 errors when max_latency_ms is too low for model processing timecritical