LlamaIndex LLM Request Rate Spike

warning

availabilityUpdated Mar 2, 2026

Abnormal spike in LLM request rate indicates potential abuse, runaway agent loops, or unexpected traffic patterns that can exhaust rate limits and inflate costs.

Technologies:

LlamaIndexsubject

How to detect:

Monitor llama_index.llm.requests rate (requests per minute). Alert when rate exceeds 2x baseline for sustained period (>5 minutes) or spikes above absolute threshold (e.g., >100 requests/min for small deployments). Correlate with llama_index.agent.steps.count to identify if agent loops are responsible.

Recommended action:

1. Investigate: Identify source of traffic spike (user, endpoint, feature). Check if legitimate traffic increase or anomalous. Review logs for error patterns that might trigger retry storms. 2. Diagnose: For agent loops: examine agent.steps.count per request to find infinite loops (e.g., >20 steps when 3-5 expected). For legitimate traffic: verify capacity planning assumptions. For abuse: check for repeated identical queries or suspicious patterns. 3. Remediate: Implement per-user/per-session rate limiting. Add circuit breakers for agent step count (e.g., max 10 steps per query). Configure LLM API rate limit retries with exponential backoff and jitter. For traffic spikes: scale infrastructure or enable request queuing. 4. Prevent: Set request rate alerts at 1.5x and 2x baseline. Dashboard llm.requests by source/user/endpoint. Implement agent step count limits in code. Add cost guardrails that pause processing when daily budget is exceeded.