Node NotReady Status and Kubelet Failures

criticalIncident Response

Diagnose why nodes enter NotReady state, causing pod rescheduling and reduced cluster capacity.

Prompt: “One of my Kubernetes nodes just went NotReady and pods are being rescheduled — how do I figure out if it's a kubelet issue, resource exhaustion, or network problem?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When a Kubernetes node goes NotReady, start by checking for resource saturation (memory and CPU >90%), which is the most common cause. Then verify disk space and I/O aren't causing kubelet timeouts, check kubelet service health directly, and finally investigate network connectivity and clock skew issues that prevent heartbeats to the control plane.

1Check for memory and CPU resource saturation

Compare `kubernetes_memory_usage` to `kubernetes_memory_capacity` — if the working set is above 90%, or `kubernetes_cpu_usage` exceeds 90% of `kubernetes_cpu_capacity`, you've found your culprit. The `node-notready-from-memory-cpu-saturation` insight tells us that sustained resource pressure causes kubelet and containerd to become unresponsive, typically after 20+ minutes of saturation. This is the single most common reason nodes go NotReady in production clusters, and it's the easiest to spot in your metrics.

Node NotReady from Memory/CPU Saturation kubernetes_memory_usagekubernetes_memory_capacitykubernetes_cpu_usagekubernetes_cpu_capacity

2Verify filesystem usage and disk I/O aren't causing kubelet stalls

Check `kubernetes_filesystem_usage` to ensure the node isn't out of disk space (kubelet marks nodes NotReady at 85%+ disk usage), and look at `kubernetes_diskio_io_service_size_stats` for I/O contention. The `disk-i-o-bottleneck-masquerading-as-application-slowness` insight shows that even with moderate disk usage (<80%), severe I/O bottlenecks can cause kubelet operations to timeout, appearing as NotReady status. If you see disk stalls or operations taking >20 seconds in logs, this is your issue.

Disk I/O Bottleneck Masquerading as Application Slowness kubernetes_filesystem_usagekubernetes_diskio_io_service_size_stats

3Check kubelet and containerd service health directly

SSH to the node and run `systemctl status kubelet` and `systemctl status containerd` (or docker, depending on your runtime). Check `journalctl -u kubelet -n 200` for crash loops, OOM kills, certificate errors, or API server connection failures. If kubelet is dead or crash-looping, that's your immediate answer — but you still need to trace back to the root cause, which is usually resource pressure (step 1), disk issues (step 2), or one of the problems below.

4Investigate network connectivity to the API server

Look at `kubernetes_network_errors` for sustained spikes — network errors prevent kubelet from sending heartbeats to the control plane, causing NotReady after missing the heartbeat timeout (default 40 seconds). From the node, test connectivity directly: `curl -k https://<api-server>:6443/healthz` should return 'ok'. Network partitions or CNI plugin failures are less common but catastrophic when they happen, and they'll leave kubelet healthy but unable to communicate.

kubernetes_network_errors

5Check for clock skew breaking TLS certificate validation

If the node's system time differs from the control plane by more than 5 minutes, TLS certificate validation will fail and kubelet can't authenticate to the API server. The `clock-skew-breaking-tls-certificate-validation` insight warns this breaks certificate validity checking silently. Run `date` on the node and compare to the control plane — if there's drift >5 minutes, sync with NTP using `timedatectl set-ntp true` or `ntpdate`. This is rare in cloud environments with managed time sync, but common in on-prem clusters.

Clock Skew Breaking TLS Certificate Validation

6Look for kernel-level issues and OOM killer activity

Run `dmesg | grep -i 'out of memory\|oom'` and `dmesg | grep -i error` to check if the kernel OOM killer has been terminating processes, including kubelet or containerd. The `memory-pressure-from-working-set-growth` insight notes that OOM kills disrupt node availability even when aggregate metrics look okay. Also check for hardware failures, bad memory modules, or storage device errors in kernel logs — these are rare but can cause intermittent NotReady flapping that's hard to diagnose from metrics alone.

Memory Pressure from Working Set Growth