Ray insights

Open SourceVersions: [current]135 metrics

Datadog

Ray

Ray Dashboard Agent Unavailability Cascade

critical

Dashboard agent on head node becomes unavailable (connection refused on port 52365), preventing Serve application status checks and creating cascading failures in RayService management.

ray_component_cpu_percentage ray_process_resident_memory ray_grpc_server_requested_finished

docs.ray.io

3mo ago▸

Ray

Ray Serve Backpressure Load Sheddingwarning

Ray Serve deployments reject requests with 503 errors when max_queued_requests limit is exceeded, indicating healthy load shedding but potential capacity issues.

ray_serve_count_http_error_requested ray_grpc_server_requested_handling ray_scheduler_unscheduleable_tasks

docs.ray.io

3mo ago▸

Ray

Ray Memory Limiter Shedding Under Traffic Spikes

critical

OpenTelemetry collector processing Ray telemetry experiences OOMKills and restarts during traffic spikes when memory_limiter processor is not configured or placed incorrectly in pipeline.

ray_serve_count_http_requested ray_scheduler_tasks ray_node_mem_used

one2n.io

4mo ago▸

Ray

Ray Telemetry High-Cardinality Cost Explosion

warning

Attaching Kubernetes pod-level attributes (k8s.pod.id, k8s.node.ip) to Ray metrics dramatically increases cardinality and observability costs, especially in autoscaling environments where pods are ephemeral.

ray_cluster_active_nodes ray_serve_count_http_requested

one2n.io

4mo ago▸

Ray

Ray Client Connection Timeout Stormcritical

Ray client connections repeatedly timeout when connecting to the cluster, indicating either network unreachability, gRPC configuration issues, or cluster head node unavailability.

ray_health_check_rpc_time ray_grpc_server_requested_process_time ray_cluster_active_nodes

github.com

docs.ray.io

1y ago▸