Technologies/BentoML/bentoml.runner.request.duration
BentoMLBentoMLMetric

bentoml.runner.request.duration

Runner inference duration
Dimensions:None
Available on:PrometheusPrometheus (1)OpenTelemetryOpenTelemetry (1)DatadogDatadog (1)
Interface Metrics (3)
PrometheusPrometheus
Histogram of runner request duration in seconds
Dimensions:None
OpenTelemetryOpenTelemetry
Duration of runner inference requests in seconds
Dimensions:None
DatadogDatadog
Duration of model inference execution in seconds
Dimensions:None

Technical Annotations (45)

Configuration Parameters (16)
metrics.duration.bucketsrecommended: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
Explicit histogram buckets for duration metrics
metrics.duration.min
Minimum bucket for exponential distribution
metrics.duration.max
Maximum bucket for exponential distribution
metrics.duration.factor
Growth factor for exponential buckets
monitoring.enabledrecommended: true
Enables BentoML monitoring capabilities
monitoring.typerecommended: default
Specifies monitoring backend type
monitoring.options.log_pathrecommended: path/to/log/file
Destination for monitoring data logs
api_server.timeoutrecommended: 60
Default timeout for API server requests in seconds — reportedly not enforced
runners.timeoutrecommended: 300
Default timeout for runner execution in seconds — reportedly not enforced
max-latencyrecommended: 10s
default API server maximum latency target
timeoutrecommended: 1.5 * max-latency
should be 1.5x max-latency and greater than max-latency
api_server.metrics.duration.minrecommended: 0.1
Minimum expected request duration in seconds for histogram tracking
api_server.metrics.duration.maxrecommended: 5.0
Maximum expected request duration in seconds for histogram tracking
api_server.metrics.duration.factorrecommended: 2.0
Exponential factor controlling bucket granularity - smaller values create more buckets
max_batch_size
Controls when batch engine splits requests; tune based on typical request sizes
max_latency_msrecommended: 600
Shorter latency window prevents request timeout accumulation
Error Signatures (2)
ServiceUnavailableexception
raise ServiceUnavailable(body.decode())exception
CLI Commands (5)
bentoml serve my_model:latest --productiondiagnostic
bentoml serve my_model:latest --reloaddiagnostic
docker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATHremediation
bentoml serve --max-latencydiagnostic
docker run -e BENTOML_CONFIG_OPTIONS='runners.timeout=3600' -it --rm -p 3000:3000 your_service serve --productionremediation
Technical References (22)
__call__componentRunnercomponentmonitoring APIcomponentbentoml.monitorcomponentconfiguration.ymlfile pathadaptive batchingcomponentrunner processcomponentAPI server processcomponentscappy_runnercomponentconfiguration.yamlfile pathrequest_duration_secondscomponentHistogramconceptgRPC deadlineprotocolBentoServercomponentrunnerscomponenthistogram bucketsconceptbatch enginecomponentRunnerAppcomponentmicro-batchingconceptasync_run_methodcomponentrunner_handlecomponentKV-cache hit rateconcept
Related Insights (13)
Custom metrics gathering adds 5ms latency overhead to inference requestsinfo
Histogram bucket configuration impacts cardinality and accuracyinfo
ML model performance degradation due to unmonitored driftcritical
Production flag causes high inter-process communication overhead in model servingwarning
Production mode introduces 10ms overhead per runner callwarning
API server timeout configuration fails to terminate long-running requestswarning
Histogram bucket misconfiguration causes incomplete latency distributionwarning
API server timeout configuration prevents request timeoutswarning
Request duration histogram bucket misconfiguration causes inaccurate latency trackingwarning
Batch splitting behavior may fragment requests across multiple execution cyclesinfo
Insufficient visibility into adaptive batching decisions impacts troubleshootingwarning
Batching configuration causes ServiceUnavailable errors under concurrent loadcritical
Static alerting thresholds fail for variable-length LLM requests causing false positives and negativeswarning