Persistent Volume Performance and I/O Latency
warningProactive Health
Diagnose slow disk I/O and storage performance issues affecting stateful workloads.
Prompt: “My database pods on Kubernetes are experiencing slow query times and I suspect it's the persistent volume storage — how can I determine if the disk I/O is the bottleneck and what storage class or configuration would perform better?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When investigating persistent volume performance issues affecting database pods, first rule out simple capacity problems, then look for I/O contention patterns in the actual disk service metrics. Cross-reference with memory pressure and storage class configuration before diving into infrastructure-specific issues like CSI drivers.
1Rule out disk capacity as the primary issue
Check `kubernetes_filesystem_usage_pct` first — if it's under 80% but you're still seeing slow queries, you're likely dealing with I/O contention, not a full disk. This is a critical distinction because the `disk-i-o-bottleneck-masquerading-as-application-slowness` insight shows that moderate filesystem usage combined with application latency almost always means inadequate IOPS or bandwidth, not lack of space. If the disk is >90% full, address capacity first before continuing.
2Look for disk I/O stalls and operation latency
Examine `kubernetes_diskio_io_service_size_stats` for elevated I/O operation times — anything consistently over 20 seconds indicates severe disk stalls. Correlate these spikes with your database query slowdowns to confirm the I/O bottleneck hypothesis. Container restarts without clear CPU or memory causes are often triggered by I/O timeouts that manifest as generic application failures.
3Check for memory pressure causing swap-to-disk
Review `kubernetes_memory_usage` against pod memory limits — if memory is near capacity, the OS may be swapping to disk, which will absolutely murder database performance. Look for major page faults in your monitoring; this shows the system is using disk as overflow for memory. If memory usage is consistently above 85% of limits, you have a memory problem disguised as a storage problem.
4Audit your storage class and IOPS provisioning
Verify that your PersistentVolumes are backed by storage classes with sufficient IOPS and throughput for database workloads. In AWS, for example, gp2 volumes have IOPS that scale with size (3 IOPS per GB), while gp3 lets you provision IOPS independently — a 100GB gp2 only gets 300 IOPS, which is woefully inadequate for most databases. Check your cloud provider's storage class definitions and ensure they match your workload's I/O requirements (typically 3000+ IOPS for transactional databases).
5Analyze disk usage growth trends
Plot `kubernetes_filesystem_usage` over time to see if you're on a trajectory toward capacity exhaustion. Progressive growth means you'll need to expand PVCs or implement data retention policies soon, even if capacity isn't the immediate bottleneck. Rapid growth combined with high I/O latency suggests your storage backend can't keep up with write volume, not just size.
6Verify CSI driver installation (AWS EBS users on K8s 1.23+)
If you're running Kubernetes 1.23 or higher on AWS with EBS-backed volumes, confirm the EBS CSI driver add-on is installed and properly configured. Without it, persistent volume operations will fail or perform unpredictably. This is a common gotcha after cluster upgrades that can cause intermittent storage issues that look like I/O problems but are actually provisioning failures.
Technologies
Related Insights
Disk I/O Bottleneck Masquerading as Application Slowness
critical
Application latency increases and container restarts occur due to disk stalls or slow persistent volume performance, but manifest as generic timeouts or OOM kills. The underlying storage bottleneck is hidden by higher-level symptoms.
Disk Saturation from Vector Growth
critical
Persistent volume usage (kubelet_volume_stats_used_bytes) approaches capacity while collections continue growing, risking write failures and cluster instability.
Kafka on Kubernetes 1.23+ with AWS EBS requires CSI driver
critical
Page Faults Indicate Memory Swapping
critical
Major page faults (process_major_page_faults_total) increase, indicating the OS is swapping vector data to disk and severely degrading query performance.