Technologies/BentoML/bentoml.system.cpu.usage
BentoMLBentoMLMetric

bentoml.system.cpu.usage

CPU utilization percentage
Dimensions:None
Available on:PrometheusPrometheus (1)DatadogDatadog (1)
Interface Metrics (2)
PrometheusPrometheus
CPU utilization percentage of the API server process
Dimensions:None
DatadogDatadog
CPU utilization percentage for BentoML service process
Dimensions:None

Technical Annotations (31)

Configuration Parameters (6)
api-workersrecommended: varies; test 0, 1, 2, up to 8
Controls number of API server processes; impacts CPU and GPU utilization
max_batch_sizerecommended: varies based on model and workload
Passed to to_runner; affects batch processing efficiency
traffic.max_concurrencyrecommended: 50
hard limit on simultaneous requests to prevent Service overload
max_latency_ms
Influences how aggressively batching accumulates requests before processing
traffic.concurrencyrecommended: slightly below maximum tested concurrent requests
Controls autoscaling threshold and prevents CPU-only scaling
workersrecommended: cpu_count
Matches number of workers to available CPU cores for optimal CPU-bound workload performance
Error Signatures (2)
EOFexception
0http status
CLI Commands (1)
k6 run --http-debug="full"diagnostic
Technical References (22)
prediction latencyconceptresource exhaustionconceptRunnercomponentpsutilcomponentpynvmlcomponent/metricsfile pathNVIDIA NVMLcomponentGPU utilizationconceptscale-to-zeroconceptOnnxRuntimecomponentrunner singletonconceptmax_concurrencycomponenttrafficcomponenthttp-servercomponentper service limitercomponentAPI ServercomponentML Model processcomponent@bentoml.service decoratorcomponentLocustcomponentobserver effectconceptshared memorycomponentGlobal Interpreter Lock (GIL)concept
Related Insights (14)
Weak hardware may contribute to timeout issues on long-running predictionswarning
Infrastructure anomalies correlate with prediction service degradationcritical
Default BentoML metrics lack system resource visibility for ML workloadswarning
GPU over-provisioning drives up infrastructure costswarning
GPU underutilization in production mode with multiprocessingwarning
CPU-intensive preprocessing in runner saturates single corewarning
Unconfigured max_concurrency allows unbounded request processing causing resource exhaustionwarning
Unbounded thread allocation per service causes resource contentionwarning
EOF errors and status code 0 under high concurrent load indicate connection dropswarning
Suboptimal batch sizes reduce throughput efficiencywarning
Service autoscaling fails when concurrency is not configuredwarning
Observer effect from indiscriminate tracing competes for resources masking true bottleneckswarning
GIL prevents multi-threaded Python from utilizing multi-core CPUswarning
Multi-tenancy resource contention causes non-reproducible performance degradationwarning