Async Trace Logging Queue Overflow in High-Traffic Production

warning

reliabilityUpdated Jan 1, 2025

When production GenAI traffic exceeds async logging queue capacity (default 1000 traces), new traces are silently discarded, creating observability blind spots during peak load or incidents.

Sources

Monitoring GenAI Application in Production | MLflowmlflow.org

Production Tracing and Monitoring | MLflowmlflow.org

Technologies:

MLflowThe root cause of this issue originates in MLflow

mlflow.tracing.queue_size

mlflow.tracing.traces_dropped

mlflow.tracing.async_workers_utilization

How to detect:

Monitor MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE utilization, trace drop count, and correlation between request volume spikes and missing traces. Check for gaps in trace continuity.

Recommended action:

Increase queue size via MLFLOW_ASYNC_TRACE_LOGGING_MAX_QUEUE_SIZE (e.g., 5000) and worker count via MLFLOW_ASYNC_TRACE_LOGGING_MAX_WORKERS (e.g., 20) based on traffic patterns. Implement trace sampling (MLFLOW_TRACE_SAMPLING_RATIO=0.1) for high-volume endpoints. Monitor queue depth and adjust dynamically.