BentoML

Request timeout configuration prevents long-running inference

warning
configurationUpdated Mar 7, 2026(via Exa)
Technologies:
How to detect:

TimeoutMiddleware enforces traffic.timeout limit on all requests. Long-running model inference or batch processing exceeding this threshold results in 503 Service Unavailable errors, even when the operation is healthy.

Recommended action:

Set traffic.timeout appropriately for your longest expected inference time. Monitor request duration distribution to identify p99 latency. For operations with variable duration, consider implementing streaming responses or splitting into async jobs.