Ray Dashboard Agent Unavailability Cascade
criticalDashboard agent on head node becomes unavailable (connection refused on port 52365), preventing Serve application status checks and creating cascading failures in RayService management.
Monitor for 'connection refused' errors when querying http://{HEAD_IP}:52365/api/serve/applications/. Track ray_component_cpu_percentage and ray_process_resident_memory for dashboard agent process disappearing. Check for gaps in ray_grpc_server_requested_finished time series indicating service interruption.
If dashboard agent process crashed, check /tmp/ray/session_latest/logs/dashboard_agent.log for root cause. Verify NetworkPolicy rules in Kubernetes allow traffic to port 52365. Implement process-level health checks and automatic restart for dashboard agent. Configure KubeRay operator to retry status checks with exponential backoff (5 attempts minimum). If dashboard agent repeatedly crashes, check ray_component_rss and ray_component_uss for memory issues, and validate runtime_env dependencies are correctly specified.