Technologies/BentoML/bentoml.system.cpu.usage

BentoMLMetric

bentoml.system.cpu.usage

CPU utilization percentage

Dimensions:None

Available on:

Prometheus (1)

Datadog (1)

Interface Metrics (2)

Prometheus

bentoml_api_server_cpu_percent

CPU utilization percentage of the API server process

Dimensions:None

Datadog

bentoml.system.cpu.usage

CPU utilization percentage for BentoML service process

Dimensions:None

Sources

bentoml_api_server_cpu_percentgithub.com

bentoml.system.cpu.usagedocs.bentoml.com

Technical Annotations (31)

Configuration Parameters (6)

api-workersrecommended: varies; test 0, 1, 2, up to 8

Controls number of API server processes; impacts CPU and GPU utilization

max_batch_sizerecommended: varies based on model and workload

Passed to to_runner; affects batch processing efficiency

traffic.max_concurrencyrecommended: 50

hard limit on simultaneous requests to prevent Service overload

max_latency_ms

Influences how aggressively batching accumulates requests before processing

traffic.concurrencyrecommended: slightly below maximum tested concurrent requests

Controls autoscaling threshold and prevents CPU-only scaling

workersrecommended: cpu_count

Matches number of workers to available CPU cores for optimal CPU-bound workload performance

Error Signatures (2)

EOFexception

0http status

CLI Commands (1)

k6 run --http-debug="full"diagnostic

Technical References (22)

prediction latencyconceptresource exhaustionconceptRunnercomponentpsutilcomponentpynvmlcomponent/metricsfile pathNVIDIA NVMLcomponentGPU utilizationconceptscale-to-zeroconceptOnnxRuntimecomponentrunner singletonconceptmax_concurrencycomponenttrafficcomponenthttp-servercomponentper service limitercomponentAPI ServercomponentML Model processcomponent@bentoml.service decoratorcomponentLocustcomponentobserver effectconceptshared memorycomponentGlobal Interpreter Lock (GIL)concept

Related Insights (14)

Weak hardware may contribute to timeout issues on long-running predictionswarning

▸

Infrastructure anomalies correlate with prediction service degradationcritical

▸

Default BentoML metrics lack system resource visibility for ML workloadswarning

▸

GPU over-provisioning drives up infrastructure costswarning

▸

GPU underutilization in production mode with multiprocessingwarning

▸

CPU-intensive preprocessing in runner saturates single corewarning

▸

Unconfigured max_concurrency allows unbounded request processing causing resource exhaustionwarning

▸

Unbounded thread allocation per service causes resource contentionwarning

▸

EOF errors and status code 0 under high concurrent load indicate connection dropswarning

▸

Suboptimal batch sizes reduce throughput efficiencywarning

▸

Service autoscaling fails when concurrency is not configuredwarning

▸

Observer effect from indiscriminate tracing competes for resources masking true bottleneckswarning

▸

GIL prevents multi-threaded Python from utilizing multi-core CPUswarning

▸

Multi-tenancy resource contention causes non-reproducible performance degradationwarning

▸