Async Trace Logging Queue Overflow in High-Traffic Production
warningreliabilityUpdated Jan 1, 2025
When production GenAI traffic exceeds async logging queue capacity (default 1000 traces), new traces are silently discarded, creating observability blind spots during peak load or incidents.
Sources
Technologies:
How to detect:
Monitor MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE utilization, trace drop count, and correlation between request volume spikes and missing traces. Check for gaps in trace continuity.
Recommended action:
Increase queue size via MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE (e.g., 5000) and worker count via MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS (e.g., 20) based on traffic patterns. Implement trace sampling (MLFLOW_TRACE_SAMPLING_RATIO=0.1) for high-volume endpoints. Monitor queue depth and adjust dynamically.