Prefill stage latency varies wildly with KV-cache layout making baseline modeling noisy
infoperformanceUpdated Jan 20, 2026(via Exa)
How to detect:
Prefill execution times fluctuate drastically with long-tail distribution even for identical input lengths due to KV-cache hit rate variations from PagedAttention and RadixAttention optimizations, making physical baseline modeling prone to noise
Recommended action:
Prioritize Decode stage latency monitoring for anomaly detection rather than Prefill. Accept Prefill variability as normal operational characteristic when PagedAttention/RadixAttention optimizations are enabled. Focus user-experience monitoring on Time-Between-Tokens rather than Time-to-First-Token.