Langfuse

Agents fail via wrong tool selection, reasoning loops, or hallucinations

warning
availabilityUpdated Mar 24, 2026
Technologies:
How to detect:

Agents can fail in nuanced ways including selecting the wrong tool, entering reasoning loops, or hallucinating in intermediate steps that produce a plausible-looking but incorrect final answer. These failures are difficult to detect with final-output-only testing.

Recommended action:

Implement a systematic evaluation approach combining three strategies: (1) Final Response (black-box) for basic correctness, (2) Trajectory (glass-box) to validate the full sequence of tool calls and reasoning steps, and (3) Single Step (white-box) to evaluate individual steps. Inspect traces manually during early development, then add automated LLM-as-a-Judge evaluators for production.