Technologies/Prometheus/crdb_cluster_node

crdb_cluster_node

Number of stores in the cluster

Dimensions:None

Knowledge Base (8 documents, 0 chunks)

documentationMonitoring and Alerting5624 wordsscore: 0.95This is the official CockroachDB documentation page covering monitoring and alerting capabilities. It describes built-in monitoring tools including the DB Console, Metrics dashboards, SQL Activity pages, Cluster API, health endpoints, and the Prometheus metrics endpoint. The page also discusses integration with third-party monitoring tools like Prometheus, Grafana, AlertManager, Datadog, and DBmarlin for external metric collection and alerting.

referenceCockroachDB | Learn Netdata2738 wordsscore: 0.95Official Netdata documentation for monitoring CockroachDB servers via Prometheus metrics and SQL statement statistics. Covers comprehensive metric collection from the /_status/vars endpoint, including performance, storage, SQL operations, replication, and real-time query analysis functions.

referenceCockroachDB monitoring and integration with Zabbix17964 wordsscore: 0.95This is a comprehensive Zabbix monitoring template for CockroachDB that collects metrics via HTTP agent from Prometheus endpoints. It provides detailed configuration instructions, macro definitions, and an extensive list of monitored items including CPU, memory, disk I/O, GC performance, KV transactions, and cluster health metrics.

blog postSQL Prober: Black-box monitoring in Managed CockroachDB3251 wordsscore: 0.75This blog post describes building SQL Prober, an internal black-box monitoring system for Managed CockroachDB. It explains the distinction between white-box metrics (internal CockroachDB metrics) and black-box metrics (customer-facing behavior), and details how SQL Prober uses geo-partitioning and replication zones to monitor cluster health by ensuring queries reach all nodes.

referenceDistributed Dashboard1311 wordsscore: 0.95Official CockroachDB documentation for the Distributed Dashboard in the DB Console. Provides comprehensive details about monitoring distribution layer health and performance metrics including batches, RPCs, KV transactions, and node heartbeat latency with percentile breakdowns.

troubleshootingDocument / implement a procedure for *safely* removing a single problematic CRDB node with MTTR < 1 hour · Issue #73763 · cockroachdb/cockroach · GitHub949 wordsscore: 0.72GitHub issue discussing operational procedures for safely removing a problematic CockroachDB node with minimal downtime (MTTR < 1 hour). The issue focuses on scenarios where a single node has local storage problems (like inverted LSM) that impact SQL reliability and cluster performance, and proposes a drain-stop-decommission procedure to handle such situations.

otherGossip implementation and node liveness753 wordsscore: 0.72This is a Google Groups discussion thread about CockroachDB's gossip protocol implementation and node liveness mechanism. The discussion explains why CockroachDB maintains persistent connections for gossip (to optimize into a spanning tree), and clarifies that while node liveness is authoritatively stored in the KV store with Raft-backed persistence, the information is disseminated through gossip for expediency, with critical operations consulting the KV store directly.

troubleshootingUnderstand Hotspots3879 wordsscore: 0.75This page provides comprehensive documentation on understanding and troubleshooting hotspots in CockroachDB clusters. It defines various types of hotspots (hot nodes, hot ranges, read/write hotspots) and common patterns like index hotspots caused by monotonically increasing keys. The content focuses on identifying performance bottlenecks that limit horizontal scalability.

Technical Annotations (16)

Configuration Parameters (1)

num_replicasrecommended: 3

Default replication factor - minimum for zone survival

Error Signatures (1)

out-of-memory panicsexception

CLI Commands (3)

cockroach node status --certs-dir=certs --host=node1.example.com:26257diagnostic

allocsimdiagnostic

zerosumdiagnostic

Technical References (11)

liveness_livenodescomponentreplica checksum comparisonsconceptmulti-region distributioncomponentautomated failovercomponentmulti-cloud deploymentconceptRaft-based replicationprotocolUS-East-1componentranges.unavailablecomponentquorumconceptliveness rangecomponentnode livenessconcept

Related Insights (12)

Under-replicated ranges risking data availabilitycritical

Ranges fall below target replication factor, creating availability risk. Often caused by node failures, decommissioning, or insufficient cluster capacity during rebalancing.

▸

Cluster liveness heartbeat degradationwarning

Elevated crdb_cluster_liveness_heartbeat_time indicates network issues, CPU starvation, or node health problems that could trigger false node failure detection and unnecessary rebalancing.

▸

Cluster Unavailability Despite DB Console Accessibilitycritical

When cluster loses quorum, ranges become unavailable and queries fail, yet DB Console and Prometheus endpoint may remain accessible (served from unavailable node's cache). Operators can be misled by accessible monitoring showing stale data while cluster is actually down, delaying incident response.

▸

Insufficient live nodes risk cluster quorum losscritical

▸

Cluster cannot stabilize when sustained at 10+ nodes for two weekscritical

▸

Cascading failures propagate across system when single component failscritical

▸

Single cloud provider outage causes multi-service application downtimecritical

▸

Live node count mismatch indicates liveness check failurecritical

▸

Ranges unavailable due to insufficient quorumcritical

▸

Node liveness heartbeat failures cause cluster instabilitycritical

▸

Node count drop indicates dead or unresponsive nodescritical

▸

Gossip network health degradationwarning

Declining crdb_cluster_gossip_infos_received rate indicates gossip network issues that can affect cluster coordination, liveness detection, and metadata propagation. This precedes more serious coordination failures.

▸