bentoml.system.cpu.usage
CPU utilization percentageDimensions:None
Interface Metrics (2)
Dimensions:None
Dimensions:None
Technical Annotations (31)
Configuration Parameters (6)
api-workersrecommended: varies; test 0, 1, 2, up to 8max_batch_sizerecommended: varies based on model and workloadtraffic.max_concurrencyrecommended: 50max_latency_mstraffic.concurrencyrecommended: slightly below maximum tested concurrent requestsworkersrecommended: cpu_countError Signatures (2)
EOFexception0http statusCLI Commands (1)
k6 run --http-debug="full"diagnosticTechnical References (22)
prediction latencyconceptresource exhaustionconceptRunnercomponentpsutilcomponentpynvmlcomponent/metricsfile pathNVIDIA NVMLcomponentGPU utilizationconceptscale-to-zeroconceptOnnxRuntimecomponentrunner singletonconceptmax_concurrencycomponenttrafficcomponenthttp-servercomponentper service limitercomponentAPI ServercomponentML Model processcomponent@bentoml.service decoratorcomponentLocustcomponentobserver effectconceptshared memorycomponentGlobal Interpreter Lock (GIL)conceptRelated Insights (14)
Weak hardware may contribute to timeout issues on long-running predictionswarning
▸
Infrastructure anomalies correlate with prediction service degradationcritical
▸
Default BentoML metrics lack system resource visibility for ML workloadswarning
▸
GPU over-provisioning drives up infrastructure costswarning
▸
GPU underutilization in production mode with multiprocessingwarning
▸
CPU-intensive preprocessing in runner saturates single corewarning
▸
Unconfigured max_concurrency allows unbounded request processing causing resource exhaustionwarning
▸
Unbounded thread allocation per service causes resource contentionwarning
▸
EOF errors and status code 0 under high concurrent load indicate connection dropswarning
▸
Suboptimal batch sizes reduce throughput efficiencywarning
▸
Service autoscaling fails when concurrency is not configuredwarning
▸
Observer effect from indiscriminate tracing competes for resources masking true bottleneckswarning
▸
GIL prevents multi-threaded Python from utilizing multi-core CPUswarning
▸
Multi-tenancy resource contention causes non-reproducible performance degradationwarning
▸