LlamaIndex Query Latency P95 Degradation
warningLlamaIndex query response times degrade at P95/P99 percentiles due to slow LLM calls, inefficient retrieval, or tool execution bottlenecks without granular latency breakdown.
Track end-to-end query latency via llama_index.query_engine.duration at P50, P95, P99 percentiles. Correlate spikes with specific workflow components: llama_index.llm.completion.duration for LLM overhead, llama_index.retrieval.duration for vector search time, llama_index.agent.step.duration for agent orchestration. Alert when P95 of query_engine.duration exceeds SLA threshold (e.g., >3000ms) or increases 2x over baseline.
1. Investigate: Build dashboard showing P50/P95/P99 for query_engine.duration alongside component breakdowns (llm.completion.duration, retrieval.duration, agent.step.duration). Identify which component dominates tail latency. 2. Diagnose: For LLM bottlenecks, check if specific models or prompt sizes are slower. For retrieval bottlenecks, analyze llama_index.retrieval.documents.count to see if over-retrieval is occurring. For agent bottlenecks, check agent.tool.calls frequency. 3. Remediate: LLM optimization: use faster/smaller models for non-critical queries, implement caching with 10-15 minute TTL. Retrieval optimization: reduce top_k, implement semantic deduplication, use hybrid search. Agent optimization: parallelize independent tool calls, reduce step complexity. 4. Prevent: Set SLA alerts on query_engine.duration P95. Implement request hedging (parallel execution with timeout) for tail latency mitigation. Dashboard latency budget breakdown per component.