Prefill stage latency varies wildly with KV-cache layout making baseline modeling noisy

info

performanceUpdated Jan 20, 2026(via Exa)

Sources

LatencyPrism: Online Non-intrusive Latency Sculpting for ...arxiv.org

Technologies:

BentoMLsubject

vLLMvLLM metrics correlate with this issue and help confirm diagnosis

How to detect:

Prefill execution times fluctuate drastically with long-tail distribution even for identical input lengths due to KV-cache hit rate variations from PagedAttention and RadixAttention optimizations, making physical baseline modeling prone to noise

Recommended action:

Prioritize Decode stage latency monitoring for anomaly detection rather than Prefill. Accept Prefill variability as normal operational characteristic when PagedAttention/RadixAttention optimizations are enabled. Focus user-experience monitoring on Time-Between-Tokens rather than Time-to-First-Token.