bentoml.runner.request.duration
Runner inference durationDimensions:None
Interface Metrics (3)
Dimensions:None
Dimensions:None
Dimensions:None
Sources
Technical Annotations (45)
Configuration Parameters (16)
metrics.duration.bucketsrecommended: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]metrics.duration.minmetrics.duration.maxmetrics.duration.factormonitoring.enabledrecommended: truemonitoring.typerecommended: defaultmonitoring.options.log_pathrecommended: path/to/log/fileapi_server.timeoutrecommended: 60runners.timeoutrecommended: 300max-latencyrecommended: 10stimeoutrecommended: 1.5 * max-latencyapi_server.metrics.duration.minrecommended: 0.1api_server.metrics.duration.maxrecommended: 5.0api_server.metrics.duration.factorrecommended: 2.0max_batch_sizemax_latency_msrecommended: 600Error Signatures (2)
ServiceUnavailableexceptionraise ServiceUnavailable(body.decode())exceptionCLI Commands (5)
bentoml serve my_model:latest --productiondiagnosticbentoml serve my_model:latest --reloaddiagnosticdocker run -p 5000:5000 YOUR_IMAGE_TAG bentoml serve $BENTO_PATHremediationbentoml serve --max-latencydiagnosticdocker run -e BENTOML_CONFIG_OPTIONS='runners.timeout=3600' -it --rm -p 3000:3000 your_service serve --productionremediationTechnical References (22)
__call__componentRunnercomponentmonitoring APIcomponentbentoml.monitorcomponentconfiguration.ymlfile pathadaptive batchingcomponentrunner processcomponentAPI server processcomponentscappy_runnercomponentconfiguration.yamlfile pathrequest_duration_secondscomponentHistogramconceptgRPC deadlineprotocolBentoServercomponentrunnerscomponenthistogram bucketsconceptbatch enginecomponentRunnerAppcomponentmicro-batchingconceptasync_run_methodcomponentrunner_handlecomponentKV-cache hit rateconceptRelated Insights (13)
Custom metrics gathering adds 5ms latency overhead to inference requestsinfo
▸
Histogram bucket configuration impacts cardinality and accuracyinfo
▸
ML model performance degradation due to unmonitored driftcritical
▸
Production flag causes high inter-process communication overhead in model servingwarning
▸
Production mode introduces 10ms overhead per runner callwarning
▸
API server timeout configuration fails to terminate long-running requestswarning
▸
Histogram bucket misconfiguration causes incomplete latency distributionwarning
▸
API server timeout configuration prevents request timeoutswarning
▸
Request duration histogram bucket misconfiguration causes inaccurate latency trackingwarning
▸
Batch splitting behavior may fragment requests across multiple execution cyclesinfo
▸
Insufficient visibility into adaptive batching decisions impacts troubleshootingwarning
▸
Batching configuration causes ServiceUnavailable errors under concurrent loadcritical
▸
Static alerting thresholds fail for variable-length LLM requests causing false positives and negativeswarning
▸