StatefulSet Scaling and Volume Attachment Delays
warningIncident Response
Diagnose delays in StatefulSet scaling caused by persistent volume attachment and mount issues.
Prompt: “My StatefulSet is stuck during scale-up with pods in Pending state showing 'FailedAttachVolume' or 'FailedMount' — how do I figure out if this is a CSI driver issue, volume attachment limit, or something else?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When StatefulSet pods are stuck in Pending with volume attachment failures, start by examining pod events and verifying the CSI driver is present, then check cloud provider volume attachment limits which are frequently hit during scale-up. Only after ruling out these common causes should you investigate PVC/PV state issues, storage class configuration, and I/O performance bottlenecks.
1Examine pod events and verify CSI driver presence
Run `kubectl describe pod <pod-name>` to see the exact error message — it usually points directly to the root cause. If you're on Kubernetes 1.23+ using AWS EBS, GCP PD, or Azure Disk, confirm the CSI driver is installed and running (`kubectl get pods -n kube-system | grep csi`). The `kafka-k8s-123-ebs-csi-driver-required` insight shows that missing CSI drivers are a common blocker on newer K8s versions. Without the CSI driver, volume attachment operations silently fail or timeout.
2Check volume attachment limits on target nodes
Cloud providers impose per-node volume attachment limits (AWS: typically 39 EBS volumes per instance, varies by instance type; GCP: 128; Azure: varies by VM size). Run `kubectl describe node <node-name>` and look at the Allocated Resources section for `attachable-volumes-*` to see current usage. If pods are pending on nodes that are at or near their attachment limit, scale-up will stall regardless of CSI driver health. This is one of the most common causes of FailedAttachVolume during StatefulSet scaling.
3Inspect PVC and PV binding status for stale state
Run `kubectl get pvc -n <namespace>` and `kubectl get pv` to verify all PVCs are Bound to PVs. Check for PVCs stuck in Pending or PVs in Released state that weren't cleaned up. The `postgresql-pvc-stale-credentials` insight shows that when StatefulSets are deleted and recreated, PVCs can retain old state causing authentication or mount failures. If you see a PVC that won't bind, check if a matching PV exists with `claimRef` pointing to a deleted pod.
4Validate storage class and dynamic provisioner configuration
Check if the StorageClass referenced by your PVC exists (`kubectl get storageclass`) and that its provisioner matches your CSI driver (e.g., `ebs.csi.aws.com` for AWS EBS CSI). Review the storage class parameters for appropriate volume type and IOPS settings. The `disk-i-o-bottleneck-masquerading-as-application-slowness` insight reminds us that using low-performance storage types (like AWS gp2 instead of gp3) can cause timeouts during mount operations that manifest as FailedMount errors.
5Check for volume mount configuration conflicts
If you're using Helm charts, verify that ConfigMap or emptyDir volumes aren't overriding your persistent volume mounts. The `helm-configmap-volume-overrides-persistence` insight shows how explicit volume mounts to paths like `/data` can prevent PersistentVolumes from being used, causing pods to fail when they expect persistent storage. Review your Helm values.yaml for any `volumes` entries that might conflict with `persistence.enabled` settings.
6Look for I/O performance bottlenecks causing mount timeouts
If volumes are attaching but mount operations are timing out, check `kubernetes_diskio_io_service_size_stats` for abnormally high I/O wait times and `kubernetes_filesystem_usage` to ensure you're not at capacity. Even with moderate filesystem usage (<80%), poor I/O performance can cause mount operations to take >20 seconds and timeout. This is especially common when using under-provisioned storage (insufficient IOPS) or during snapshot recovery operations that block shard activation.
7Verify node-level scheduling constraints and capacity
Check if pods can schedule to nodes by reviewing taints, tolerations, and node selectors on both the StatefulSet and nodes. Run `kubectl describe node` to check for resource pressure conditions (MemoryPressure, DiskPressure, PIDPressure) that might prevent new pods from scheduling. Also monitor `kubernetes_cpu_usage` and `kubernetes_memory_usage` to ensure nodes have capacity for the new pods — sometimes volume attachment fails because the kubelet is under resource pressure and can't complete the mount operation.
Technologies
Related Insights
Kafka on Kubernetes 1.23+ with AWS EBS requires CSI driver
critical
Disk I/O Bottleneck Masquerading as Application Slowness
critical
Application latency increases and container restarts occur due to disk stalls or slow persistent volume performance, but manifest as generic timeouts or OOM kills. The underlying storage bottleneck is hidden by higher-level symptoms.
Helm chart configMap volume mount prevents certificate persistence
critical
Snapshot Recovery Delays Cluster Startup
info
Snapshot recovery operations (snapshot_recovery_running) block shard activation during pod restarts, extending downtime and reducing cluster availability during deployments.
Network restrictions prevent backup restore data transfer while allowing schema operations
critical
Stale PVCs retain outdated credentials after PostgreSQL redeployment
warning