BentoML

Aggregated request-level metrics mask micro-stalls in token generation

warning
performanceUpdated Jan 20, 2026(via Exa)
Technologies:
How to detect:

Request-level end-to-end latency averaging obscures sporadic generation stalls during intermediate token generation steps, causing false-negative anomaly detection while user experience degrades

Recommended action:

Instrument monitoring at batch/token granularity rather than request level. Implement per-iteration latency tracking to detect stalls within long-running inference requests. Switch from request-level p50/p99 to token-level TBT monitoring.