Ray

Ray Client Connection Timeout Storm

critical
Connection ManagementUpdated Jun 12, 2025

Ray client connections repeatedly timeout when connecting to the cluster, indicating either network unreachability, gRPC configuration issues, or cluster head node unavailability.

How to detect:

Monitor for repeated ConnectionError exceptions with 'ray client connection timeout' messages. Track ray_health_check_rpc_time and ray_grpc_server_requested_process_time metrics showing elevated latency or failures. Check for warnings about Ray Client port unreachability.

Recommended action:

First verify network connectivity and firewall rules allow access to Ray Client port (default 10001) and GCS port (6379). Check that ray_cluster_active_nodes shows head node as active. If using Kubernetes, verify NetworkPolicy allows traffic to dashboard agent port 52365. Implement exponential backoff with jitter in client retry logic, with max 5 retries and backoff_factor of 1. Verify Ray head node dashboard and GCS services are running via ray_component_cpu_percentage and ray_component_rss metrics.