BentoMLvLLM

Prefill stage latency varies wildly with KV-cache layout making baseline modeling noisy

info
performanceUpdated Jan 20, 2026(via Exa)
How to detect:

Prefill execution times fluctuate drastically with long-tail distribution even for identical input lengths due to KV-cache hit rate variations from PagedAttention and RadixAttention optimizations, making physical baseline modeling prone to noise

Recommended action:

Prioritize Decode stage latency monitoring for anomaly detection rather than Prefill. Accept Prefill variability as normal operational characteristic when PagedAttention/RadixAttention optimizations are enabled. Focus user-experience monitoring on Time-Between-Tokens rather than Time-to-First-Token.