Aggregated request-level metrics mask micro-stalls in token generation
warningperformanceUpdated Jan 20, 2026(via Exa)
Technologies:
How to detect:
Request-level end-to-end latency averaging obscures sporadic generation stalls during intermediate token generation steps, causing false-negative anomaly detection while user experience degrades
Recommended action:
Instrument monitoring at batch/token granularity rather than request level. Implement per-iteration latency tracking to detect stalls within long-running inference requests. Switch from request-level p50/p99 to token-level TBT monitoring.