Generation stalls cause multi-second inter-token delays degrading user experience

critical

performanceUpdated Jan 20, 2026(via Exa)

Sources

LatencyPrism: Online Non-intrusive Latency Sculpting for ...arxiv.org

Technologies:

BentoMLsubject

How to detect:

Time-Between-Tokens (TBT) pauses for seconds during LLM inference generation, causing stuttering chatbot responses that violate SLAs despite acceptable average latency metrics

Recommended action:

Monitor batch-level inference latency rather than request-level aggregates. Implement real-time TBT monitoring to detect generation stalls before they aggregate into visible p50 metrics. Focus on Decode stage latency as primary indicator of user-perceived fluidity.