CockroachDB insights
Source AvailableVersions: [current]44 metricsWhen cluster loses quorum, ranges become unavailable and queries fail, yet DB Console and Prometheus endpoint may remain accessible (served from unavailable node's cache). Operators can be misled by accessible monitoring showing stale data while cluster is actually down, delaying incident response.
Nodes become critical when their failure would cause replica unavailability. Detecting critical nodes before termination prevents data loss and service disruption. This is indicated by the /_status/critical_nodes endpoint showing non-empty criticalNodes array.
CockroachDB ranges with fewer live replicas than needed for quorum (cockroachdb.ranges_replication_problem with unavailable ranges) indicate impending data unavailability. This is the critical pre-failure signal before queries start failing due to lost quorum.
CockroachDB insights requires a free account. Sign in with Google or GitHub to access the full knowledge base.
Sign in to access