Toil-Causing Architectural Misalignment
Poorly aligned systems that require repeated operator intervention — not all toil is procedural.
Identify services requiring repeated manual tuning because autoscaling, resource limits, or dependency settings are poorly aligned with real workload behavior; include the specific misalignment and tuning pattern.
2. Which recurring operator interventions could be eliminated by automation?Correlate recurring operator interventions with predictable telemetry or configuration states; identify which interventions could be eliminated by automation or better system fit.
3. Which workloads repeatedly exceed safe thresholds due to static limits?Identify workloads that repeatedly exceed safe thresholds due to static limits that should be policy-driven or automated.
4. Where are teams manually compensating for missing backpressure or retry control?Identify where teams are compensating manually for missing backpressure, retry control, or quota enforcement across service boundaries.
5. Which systems generate the most operational churn because telemetry hides the true bottleneck?Identify systems that generate the most operational churn because telemetry does not expose the true bottleneck layer; include the observed symptom, the likely hidden cause, and the missing signal.