CrewAI

Agent Task Success Rate Degradation

warning
reliabilityUpdated Dec 19, 2025

Non-deterministic LLM behavior and tool failures can cause agent task success rates to drift downward over time without obvious errors, impacting output quality and user satisfaction.

How to detect:

Track task completion success/failure rates per agent and per task type. Monitor for degradation trends (rolling 7-day success rate dropping below baseline). Alert when success rate falls below SLO threshold (e.g., <95%). Correlate with LLM model changes, prompt modifications, or tool availability issues.

Recommended action:

Implement continuous evaluation using LLM-as-Judge or human review for quality validation. Create benchmark test suites to detect regressions. Monitor prompt effectiveness and iterate on prompt engineering. Implement retry logic with exponential backoff for transient failures. Set up regression tests to block deployments when quality drops.