Aggregated request-level metrics mask micro-stalls in token generation

warning

performanceUpdated Jan 20, 2026(via Exa)

Sources

LatencyPrism: Online Non-intrusive Latency Sculpting for ...arxiv.org

Technologies:

BentoMLsubject

How to detect:

Request-level end-to-end latency averaging obscures sporadic generation stalls during intermediate token generation steps, causing false-negative anomaly detection while user experience degrades

Recommended action:

Instrument monitoring at batch/token granularity rather than request level. Implement per-iteration latency tracking to detect stalls within long-running inference requests. Switch from request-level p50/p99 to token-level TBT monitoring.