LlamaIndexOpenAI

LlamaIndex LLM Token Budget Overrun

warning
performanceUpdated Aug 28, 2025

LlamaIndex agents can exceed token budgets via unmonitored prompt/completion expansion, causing unexpected costs and API rate limit errors without per-request cost tracking.

How to detect:

Monitor llama_index.llm.completion.tokens.prompt and llama_index.llm.completion.tokens.completion per LLM call. Alert when per-request token usage (llama_index.llm.tokens.total) exceeds threshold (e.g., >4000 tokens for GPT-3.5-turbo context) or when daily aggregate token burn rate projects beyond budget. Track rate of llama_index.llm.requests to estimate cost trajectory.

Recommended action:

1. Investigate: Query observability platform for top consumers by endpoint, user, or query type. Calculate average tokens per request (total/requests) and compare to baseline. 2. Diagnose: Identify queries with abnormally high prompt tokens (indicating context bloat) or completion tokens (indicating overly verbose responses). Check for prompt template inefficiencies or excessive context retrieval. 3. Remediate: Implement token counting guards in application code to truncate context when approaching limits. Configure LLM parameters (max_tokens, temperature) to constrain response length. Add caching for repeated queries. 4. Prevent: Set up cost alerts based on llama_index.llm.tokens.total with daily/weekly thresholds. Dashboard token usage per application feature to identify optimization targets. Implement retry backoff when rate limits are hit.