RayPrometheus

Ray Memory Limiter Shedding Under Traffic Spikes

critical
Resource ContentionUpdated Feb 2, 2026

OpenTelemetry collector processing Ray telemetry experiences OOMKills and restarts during traffic spikes when memory_limiter processor is not configured or placed incorrectly in pipeline.

How to detect:

Monitor for collector pod restarts coinciding with Ray traffic peaks (ray_serve_count_http_requested rate increases). Track collector memory usage approaching limits. Correlate OOMKill events from Kubernetes with ray_scheduler_tasks and ray_actors metrics showing burst activity.

Recommended action:

Configure memory_limiter processor as FIRST step in OTEL collector pipeline with limit_mib set to 80% of pod memory limit and spike_limit_mib at 10-20% of limit_mib. Set GOMEMLIMIT environment variable to match limit_mib. Place memory_limiter before all other processors to enable backpressure to receivers. Tune check_interval to 1-5s based on traffic patterns. Accept that data loss during extreme spikes is preferable to collector crashes. Monitor that soft limit triggers (limit_mib - spike_limit_mib) occur before hard limit and garbage collection.