Cluster Unavailability Despite DB Console Accessibility
criticalWhen cluster loses quorum, ranges become unavailable and queries fail, yet DB Console and Prometheus endpoint may remain accessible (served from unavailable node's cache). Operators can be misled by accessible monitoring showing stale data while cluster is actually down, delaying incident response.
Monitor ranges.unavailable >0 or liveness.livenodes dropping below quorum threshold (< (total_nodes/2)+1). Check /health?ready=1 endpoint returning 503. If DB Console accessible but crdb_internal queries fail or timeseries data stops updating, cluster may be down while monitoring appears functional.
Immediately configure external monitoring (Prometheus, Datadog) scraping metrics endpoint periodically to retain historical data during outages. Set up alerting on liveness.livenodes dropping or ranges.unavailable >0. When cluster becomes unavailable, consult stored external metrics and cluster logs to investigate root cause. Do not rely solely on DB Console during outages.