Static latency thresholds fail under variable request length causing false positives

warning

performanceUpdated Jan 20, 2026(via Exa)

Sources

LatencyPrism: Online Non-intrusive Latency Sculpting for ...arxiv.org

Technologies:

BentoMLsubject

How to detect:

Traditional static threshold alerting generates excessive false alarms for normal long-text requests or misses performance regressions in short-text requests due to wide distribution of normal processing times driven by input/output token length variance

Recommended action:

Implement workload-aware dynamic baselines that account for input token length, output token length, and KV-cache hit rates. Build theoretical expected duration models based on current batch characteristics rather than static thresholds. Compare actual execution time against workload-adjusted expectations.