Time-to-First-Token (TTFT) Spikes Under Load
criticalTTFT combines scheduling delay and prompt processing time, making it highly sensitive to system load and prompt length. Spikes indicate resource contention (GPU memory, queuing) or unexpectedly large prompts, directly degrading user-perceived responsiveness.
Monitor gen_ai_server_time_to_first_token and langchain_request_time for p95/p99 increases, especially when correlated with rising langchain_chain_run counts or langchain_tokens_prompt spikes. Cross-reference with error rates (langchain_request_error, langchain_chain_error) to rule out failures masking latency.
Identify whether spikes are load-driven (implement request throttling, autoscaling, or layer-wise KV cache offloading) or prompt-driven (apply dynamic token pruning, validate prompt construction logic). Use fluidity-index or TBT metrics to confirm if streaming consistency also degrades.