Ray Memory Limiter Shedding Under Traffic Spikes
criticalOpenTelemetry collector processing Ray telemetry experiences OOMKills and restarts during traffic spikes when memory_limiter processor is not configured or placed incorrectly in pipeline.
Monitor for collector pod restarts coinciding with Ray traffic peaks (ray_serve_count_http_requested rate increases). Track collector memory usage approaching limits. Correlate OOMKill events from Kubernetes with ray_scheduler_tasks and ray_actors metrics showing burst activity.
Configure memory_limiter processor as FIRST step in OTEL collector pipeline with limit_mib set to 80% of pod memory limit and spike_limit_mib at 10-20% of limit_mib. Set GOMEMLIMIT environment variable to match limit_mib. Place memory_limiter before all other processors to enable backpressure to receivers. Tune check_interval to 1-5s based on traffic patterns. Accept that data loss during extreme spikes is preferable to collector crashes. Monitor that soft limit triggers (limit_mib - spike_limit_mib) occur before hard limit and garbage collection.