MLflow

Async Trace Logging Queue Overflow in High-Traffic Production

warning
reliabilityUpdated Jan 1, 2025

When production GenAI traffic exceeds async logging queue capacity (default 1000 traces), new traces are silently discarded, creating observability blind spots during peak load or incidents.

How to detect:

Monitor MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE utilization, trace drop count, and correlation between request volume spikes and missing traces. Check for gaps in trace continuity.

Recommended action:

Increase queue size via MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE (e.g., 5000) and worker count via MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS (e.g., 20) based on traffic patterns. Implement trace sampling (MLFLOW_TRACE_SAMPLING_RATIO=0.1) for high-volume endpoints. Monitor queue depth and adjust dynamically.