Generation stalls cause multi-second inter-token delays degrading user experience
criticalperformanceUpdated Jan 20, 2026(via Exa)
Technologies:
How to detect:
Time-Between-Tokens (TBT) pauses for seconds during LLM inference generation, causing stuttering chatbot responses that violate SLAs despite acceptable average latency metrics
Recommended action:
Monitor batch-level inference latency rather than request-level aggregates. Implement real-time TBT monitoring to detect generation stalls before they aggregate into visible p50 metrics. Focus on Decode stage latency as primary indicator of user-perceived fluidity.