NVIDIA Networking

RAS Network Partition Indicates Node Failure

reliability

NCCL RAS (Reliability, Availability, Serviceability) subsystem detects when processes become unreachable via RAS keep-alive network, indicating node crashes, hangs, or network partitions in distributed training jobs.

NVIDIA Networking insight details requires a free account. Sign in with Google or GitHub to access the full knowledge base.

Sign in to access