LlamaIndex Agent Workflow Observability Gap
criticalMulti-agent LlamaIndex systems lack visibility into agent handoffs, tool execution timing, and workflow failures without proper instrumentation, leading to slow debugging and degraded production reliability.
Monitor for missing trace data across agent interactions: llama_index.agent.steps.count = 0 when requests are occurring, undefined agent handoff reasons, absence of per-agent latency metrics via llama_index.agent.step.duration, and inability to correlate llama_index.agent.tool.calls to specific agents in multi-agent workflows.
1. Investigate: Check if OpenTelemetry instrumentation is properly initialized and trace context is propagated across async boundaries. Verify agent callbacks are registered. 2. Diagnose: Enable debug logging to confirm agent step execution is actually happening. Check for missing trace IDs in logs. 3. Remediate: Implement structured logging with trace IDs linking agent decisions to outcomes. Configure OpenTelemetry auto-instrumentation or manual spans for each agent operation. 4. Prevent: Establish baseline dashboards showing agent.steps.count and agent.step.duration per workflow type, alerting when zero activity occurs during expected execution windows.