Dashboard agent on head node becomes unavailable (connection refused on port 52365), preventing Serve application status checks and creating cascading failures in RayService management.
Ray Serve deployments reject requests with 503 errors when max_queued_requests limit is exceeded, indicating healthy load shedding but potential capacity issues.
OpenTelemetry collector processing Ray telemetry experiences OOMKills and restarts during traffic spikes when memory_limiter processor is not configured or placed incorrectly in pipeline.
Attaching Kubernetes pod-level attributes (k8s.pod.id, k8s.node.ip) to Ray metrics dramatically increases cardinality and observability costs, especially in autoscaling environments where pods are ephemeral.
Ray client connections repeatedly timeout when connecting to the cluster, indicating either network unreachability, gRPC configuration issues, or cluster head node unavailability.