MLflowOpenAI

LLM Judge Evaluation Cost Spiral

warning
cost_managementUpdated Jan 1, 2025

Automatic quality evaluation using LLM judges on 100% of production traces creates unsustainable API costs, especially for high-volume GenAI applications with thousands of daily requests.

How to detect:

Monitor LLM API costs, evaluation request volume, judge execution rate, and correlation with production traffic. Track cost per evaluation and total daily evaluation spend.

Recommended action:

Configure sampling for judges: 'ScorerSamplingConfig(sample_rate=0.1)' to evaluate only 10% of traces. Use filter_string to target specific traces (e.g., errors, high-latency, specific users). Implement tiered sampling: 100% for critical paths, 10% for normal traffic, 1% for high-volume low-value endpoints.