StatefulSet Scaling and Volume Attachment Delays

warningIncident Response

Diagnose delays in StatefulSet scaling caused by persistent volume attachment and mount issues.

Prompt: “My StatefulSet is stuck during scale-up with pods in Pending state showing 'FailedAttachVolume' or 'FailedMount' — how do I figure out if this is a CSI driver issue, volume attachment limit, or something else?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When StatefulSet pods are stuck in Pending with volume attachment failures, start by examining pod events and verifying the CSI driver is present, then check cloud provider volume attachment limits which are frequently hit during scale-up. Only after ruling out these common causes should you investigate PVC/PV state issues, storage class configuration, and I/O performance bottlenecks.

1Examine pod events and verify CSI driver presence

Run `kubectl describe pod <pod-name>` to see the exact error message — it usually points directly to the root cause. If you're on Kubernetes 1.23+ using AWS EBS, GCP PD, or Azure Disk, confirm the CSI driver is installed and running (`kubectl get pods -n kube-system | grep csi`). The `kafka-k8s-123-ebs-csi-driver-required` insight shows that missing CSI drivers are a common blocker on newer K8s versions. Without the CSI driver, volume attachment operations silently fail or timeout.

Kafka on Kubernetes 1.23+ with AWS EBS requires CSI driver

2Check volume attachment limits on target nodes

Cloud providers impose per-node volume attachment limits (AWS: typically 39 EBS volumes per instance, varies by instance type; GCP: 128; Azure: varies by VM size). Run `kubectl describe node <node-name>` and look at the Allocated Resources section for `attachable-volumes-*` to see current usage. If pods are pending on nodes that are at or near their attachment limit, scale-up will stall regardless of CSI driver health. This is one of the most common causes of FailedAttachVolume during StatefulSet scaling.

3Inspect PVC and PV binding status for stale state

Run `kubectl get pvc -n <namespace>` and `kubectl get pv` to verify all PVCs are Bound to PVs. Check for PVCs stuck in Pending or PVs in Released state that weren't cleaned up. The `postgresql-pvc-stale-credentials` insight shows that when StatefulSets are deleted and recreated, PVCs can retain old state causing authentication or mount failures. If you see a PVC that won't bind, check if a matching PV exists with `claimRef` pointing to a deleted pod.

Stale PVCs retain outdated credentials after PostgreSQL redeployment

4Validate storage class and dynamic provisioner configuration

Check if the StorageClass referenced by your PVC exists (`kubectl get storageclass`) and that its provisioner matches your CSI driver (e.g., `ebs.csi.aws.com` for AWS EBS CSI). Review the storage class parameters for appropriate volume type and IOPS settings. The `disk-i-o-bottleneck-masquerading-as-application-slowness` insight reminds us that using low-performance storage types (like AWS gp2 instead of gp3) can cause timeouts during mount operations that manifest as FailedMount errors.

Disk I/O Bottleneck Masquerading as Application Slowness

5Check for volume mount configuration conflicts

If you're using Helm charts, verify that ConfigMap or emptyDir volumes aren't overriding your persistent volume mounts. The `helm-configmap-volume-overrides-persistence` insight shows how explicit volume mounts to paths like `/data` can prevent PersistentVolumes from being used, causing pods to fail when they expect persistent storage. Review your Helm values.yaml for any `volumes` entries that might conflict with `persistence.enabled` settings.

Helm chart configMap volume mount prevents certificate persistence

6Look for I/O performance bottlenecks causing mount timeouts

If volumes are attaching but mount operations are timing out, check `kubernetes_diskio_io_service_size_stats` for abnormally high I/O wait times and `kubernetes_filesystem_usage` to ensure you're not at capacity. Even with moderate filesystem usage (<80%), poor I/O performance can cause mount operations to take >20 seconds and timeout. This is especially common when using under-provisioned storage (insufficient IOPS) or during snapshot recovery operations that block shard activation.

Disk I/O Bottleneck Masquerading as Application Slowness Snapshot Recovery Delays Cluster Startup kubernetes_diskio_io_service_size_statskubernetes_filesystem_usage

7Verify node-level scheduling constraints and capacity

Check if pods can schedule to nodes by reviewing taints, tolerations, and node selectors on both the StatefulSet and nodes. Run `kubectl describe node` to check for resource pressure conditions (MemoryPressure, DiskPressure, PIDPressure) that might prevent new pods from scheduling. Also monitor `kubernetes_cpu_usage` and `kubernetes_memory_usage` to ensure nodes have capacity for the new pods — sometimes volume attachment fails because the kubelet is under resource pressure and can't complete the mount operation.

kubernetes_cpu_usagekubernetes_memory_usage