Technologies/BentoML/bentoml.api_server.request.in_progress
BentoMLBentoMLMetric

bentoml.api_server.request.in_progress

Active API server requests
Dimensions:None
Available on:OpenTelemetryOpenTelemetry (1)
Interface Metrics (1)
OpenTelemetryOpenTelemetry
Number of API server requests currently in progress
Dimensions:None

Technical Annotations (34)

Configuration Parameters (6)
workersrecommended: cpu_count
enable process-level parallelism for high-throughput or compute-intensive Services
traffic.max_concurrencyrecommended: 50
hard limit on simultaneous requests to prevent Service overload
sleeprecommended: greater than 0.001
K6 test script sleep interval between iterations; 0.001s may be too aggressive
threads
parameter in @bentoml.service decorator to set concurrency level for synchronous endpoints
services.<service_name>.scaling.min_replicasrecommended: 0
Enables scale-to-zero but introduces cold start latency
@bentoml.service.threadsrecommended: N
Set to enable concurrent requests from sync endpoints to batchable services
Error Signatures (7)
503http status
CRITICAL] WORKER TIMEOUTlog pattern
Worker exitinglog pattern
502http status
504http status
ValueError: unexpected end of streamexception
Exception on /predict [POST]log pattern
CLI Commands (1)
curl -X GET https://<deployment-url>/readyzremediation
Technical References (20)
request queuecomponentconcurrency-based autoscalingconceptworkerscomponent@bentoml.servicecomponentmax_concurrencycomponenttrafficcomponentanyio.to_thread.run_synccomponentcapacity limitercomponentGunicorn workercomponentworker timeoutconceptMultiFileInput adaptercomponentwerkzeug.formparsercomponent/predictcomponenthttp-servercomponentper service limitercomponent@bentoml.service decoratorcomponentsynchronous endpointconcept/readyz endpointcomponentscale-to-zeroconcept@bentoml.api(batchable=True)component
Related Insights (14)
Request overload without queuing causes service instabilitycritical
Synchronous API functions create throughput bottleneck in productionwarning
Single worker configuration causes request queuing and poor throughputwarning
Unconfigured max_concurrency allows unbounded request processing causing resource exhaustionwarning
Thread pool exhaustion prevents synchronous API method executioncritical
MaxConcurrencyMiddleware returns 503 under loadwarning
Worker timeout and crash loop under concurrent request loadcritical
Multipart form parsing failure under concurrent loadwarning
Unbounded thread allocation per service causes resource contentionwarning
Extremely low sleep interval in load tests may exhaust connection poolinfo
API server request backlog indicates upstream bottleneckwarning
Synchronous endpoints limit batching throughput to one request at a timewarning
Service unavailable during scale-from-zero without manual readiness probeinfo
Insufficient batching throughput when sync endpoints call batchable services with default concurrencywarning