Node NotReady Status and Kubelet Failures
criticalIncident Response
Diagnose why nodes enter NotReady state, causing pod rescheduling and reduced cluster capacity.
Prompt: “One of my Kubernetes nodes just went NotReady and pods are being rescheduled — how do I figure out if it's a kubelet issue, resource exhaustion, or network problem?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When a Kubernetes node goes NotReady, start by checking for resource saturation (memory and CPU >90%), which is the most common cause. Then verify disk space and I/O aren't causing kubelet timeouts, check kubelet service health directly, and finally investigate network connectivity and clock skew issues that prevent heartbeats to the control plane.
1Check for memory and CPU resource saturation
Compare `kubernetes_memory_usage` to `kubernetes_memory_capacity` — if the working set is above 90%, or `kubernetes_cpu_usage` exceeds 90% of `kubernetes_cpu_capacity`, you've found your culprit. The `node-notready-from-memory-cpu-saturation` insight tells us that sustained resource pressure causes kubelet and containerd to become unresponsive, typically after 20+ minutes of saturation. This is the single most common reason nodes go NotReady in production clusters, and it's the easiest to spot in your metrics.
2Verify filesystem usage and disk I/O aren't causing kubelet stalls
Check `kubernetes_filesystem_usage` to ensure the node isn't out of disk space (kubelet marks nodes NotReady at 85%+ disk usage), and look at `kubernetes_diskio_io_service_size_stats` for I/O contention. The `disk-i-o-bottleneck-masquerading-as-application-slowness` insight shows that even with moderate disk usage (<80%), severe I/O bottlenecks can cause kubelet operations to timeout, appearing as NotReady status. If you see disk stalls or operations taking >20 seconds in logs, this is your issue.
3Check kubelet and containerd service health directly
SSH to the node and run `systemctl status kubelet` and `systemctl status containerd` (or docker, depending on your runtime). Check `journalctl -u kubelet -n 200` for crash loops, OOM kills, certificate errors, or API server connection failures. If kubelet is dead or crash-looping, that's your immediate answer — but you still need to trace back to the root cause, which is usually resource pressure (step 1), disk issues (step 2), or one of the problems below.
4Investigate network connectivity to the API server
Look at `kubernetes_network_errors` for sustained spikes — network errors prevent kubelet from sending heartbeats to the control plane, causing NotReady after missing the heartbeat timeout (default 40 seconds). From the node, test connectivity directly: `curl -k https://<api-server>:6443/healthz` should return 'ok'. Network partitions or CNI plugin failures are less common but catastrophic when they happen, and they'll leave kubelet healthy but unable to communicate.
5Check for clock skew breaking TLS certificate validation
If the node's system time differs from the control plane by more than 5 minutes, TLS certificate validation will fail and kubelet can't authenticate to the API server. The `clock-skew-breaking-tls-certificate-validation` insight warns this breaks certificate validity checking silently. Run `date` on the node and compare to the control plane — if there's drift >5 minutes, sync with NTP using `timedatectl set-ntp true` or `ntpdate`. This is rare in cloud environments with managed time sync, but common in on-prem clusters.
6Look for kernel-level issues and OOM killer activity
Run `dmesg | grep -i 'out of memory\|oom'` and `dmesg | grep -i error` to check if the kernel OOM killer has been terminating processes, including kubelet or containerd. The `memory-pressure-from-working-set-growth` insight notes that OOM kills disrupt node availability even when aggregate metrics look okay. Also check for hardware failures, bad memory modules, or storage device errors in kernel logs — these are rare but can cause intermittent NotReady flapping that's hard to diagnose from metrics alone.
Technologies
Related Insights
Node NotReady from Memory/CPU Saturation
critical
AKS nodes enter NotReady state when memory or CPU saturation causes kubelet and containerd to become unresponsive. This occurs when resource limits are exceeded or PSI metrics indicate sustained pressure.
Clock Skew Breaking TLS Certificate Validation
critical
Time differences exceeding 5 minutes between control plane and cluster nodes cause TLS validation failures, as nodes may incorrectly determine certificates are expired or not yet valid.
Disk I/O Bottleneck Masquerading as Application Slowness
critical
Application latency increases and container restarts occur due to disk stalls or slow persistent volume performance, but manifest as generic timeouts or OOM kills. The underlying storage bottleneck is hidden by higher-level symptoms.
Memory Pressure from Working Set Growth
warning
Container working set memory approaches or exceeds configured limits, causing OOM kills or evictions that disrupt vector search availability.