BentoML

Generation stalls cause multi-second inter-token delays degrading user experience

critical
performanceUpdated Jan 20, 2026(via Exa)
Technologies:
How to detect:

Time-Between-Tokens (TBT) pauses for seconds during LLM inference generation, causing stuttering chatbot responses that violate SLAs despite acceptable average latency metrics

Recommended action:

Monitor batch-level inference latency rather than request-level aggregates. Implement real-time TBT monitoring to detect generation stalls before they aggregate into visible p50 metrics. Focus on Decode stage latency as primary indicator of user-perceived fluidity.