LlamaIndex Embedding Batch Processing Inefficiency

info

performanceUpdated Mar 2, 2026

Document embedding during indexing is inefficient due to small batch sizes or lack of batching, causing excessive API calls, increased latency, and higher costs compared to batch processing.

Technologies:

LlamaIndexsubject

How to detect:

Monitor the ratio of llama_index.embedding.requests to documents processed. Alert when embedding requests are abnormally high relative to document count (e.g., >1 request per document when batch API is available). Track llama_index.embedding.duration to identify if sequential processing is adding latency.

Recommended action:

1. Investigate: Calculate embeddings per request ratio during indexing. Check if embedding provider supports batch API. Review indexing code to verify batch processing is implemented. 2. Diagnose: Determine optimal batch size based on API limits and document size distribution. Check if current implementation processes documents sequentially or uses small batches. Measure latency difference between single and batch requests. 3. Remediate: Implement batch embedding API calls with optimal batch size (typically 16-64 documents depending on provider). Add async/parallel processing for multiple batches. Configure batch accumulation with timeout (e.g., batch every 100 documents or 5 seconds). 4. Prevent: Document expected embedding.requests to document ratio for monitoring. Alert on deviations from expected batch efficiency. Dashboard embedding request efficiency metrics. Include batch processing tests in indexing pipeline validation.