LLM Judge Evaluation Cost Spiral

warning

cost_managementUpdated Jan 1, 2025

Automatic quality evaluation using LLM judges on 100% of production traces creates unsustainable API costs, especially for high-volume GenAI applications with thousands of daily requests.

Sources

Monitoring GenAI Application in Production | MLflowmlflow.org

Production Tracing and Monitoring | MLflowmlflow.org

Technologies:

MLflowThe root cause of this issue originates in MLflow

mlflow.judges.api_calls_total

mlflow.judges.cost_dollars

mlflow.judges.evaluations_per_hour

OpenAIOpenAI metrics correlate with this issue and help confirm diagnosis

How to detect:

Monitor LLM API costs, evaluation request volume, judge execution rate, and correlation with production traffic. Track cost per evaluation and total daily evaluation spend.

Recommended action:

Configure sampling for judges: 'ScorerSamplingConfig(sample_rate=0.1)' to evaluate only 10% of traces. Use filter_string to target specific traces (e.g., errors, high-latency, specific users). Implement tiered sampling: 100% for critical paths, 10% for normal traffic, 1% for high-volume low-value endpoints.