Network Partition Detection via TCP Metrics
criticalElevated Net_TCP_Lost packets and high Net_TCP_RxQ/Net_TCP_TxQ indicate network congestion or partial partitions between nodes, leading to cluster instability and potential split-brain scenarios if combined with cluster manager election failures.
Watch for Net_TCP_Lost increasing alongside Net_TCP_TxQ and Net_TCP_RxQ queue buildup. Correlate with LeaderCheck_Failure and FollowerCheck_Failure metrics to detect cluster manager communication issues. Monitor Net_PacketDropRate4/6 for general packet loss.
Investigate network infrastructure for congestion, check for misconfigured MTU causing fragmentation, verify node-to-node connectivity, review firewall rules, and consider network segmentation or dedicated cluster manager nodes. Enable detailed transport layer logging via org.opensearch.transport.