Ray Memory Limiter Shedding Under Traffic Spikes

critical

Resource ContentionUpdated Feb 2, 2026

OpenTelemetry collector processing Ray telemetry experiences OOMKills and restarts during traffic spikes when memory_limiter processor is not configured or placed incorrectly in pipeline.

Sources

The Gotchas of OTEL collector processors for effective observability in K8s | Sanket Rajgiri | One2None2n.io

Technologies:

RaySymptoms of this issue are visible in Ray metrics and logs

PrometheusPrometheus metrics correlate with this issue and help confirm diagnosis

How to detect:

Monitor for collector pod restarts coinciding with Ray traffic peaks (ray_serve_count_http_requested rate increases). Track collector memory usage approaching limits. Correlate OOMKill events from Kubernetes with ray_scheduler_tasks and ray_actors metrics showing burst activity.

Recommended action:

Configure memory_limiter processor as FIRST step in OTEL collector pipeline with limit_mib set to 80% of pod memory limit and spike_limit_mib at 10-20% of limit_mib. Set GOMEMLIMIT environment variable to match limit_mib. Place memory_limiter before all other processors to enable backpressure to receivers. Tune check_interval to 1-5s based on traffic patterns. Accept that data loss during extreme spikes is preferable to collector crashes. Monitor that soft limit triggers (limit_mib - spike_limit_mib) occur before hard limit and garbage collection.