Persistent Volume Performance and I/O Latency

warningProactive Health

Diagnose slow disk I/O and storage performance issues affecting stateful workloads.

Prompt: “My database pods on Kubernetes are experiencing slow query times and I suspect it's the persistent volume storage — how can I determine if the disk I/O is the bottleneck and what storage class or configuration would perform better?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When investigating persistent volume performance issues affecting database pods, first rule out simple capacity problems, then look for I/O contention patterns in the actual disk service metrics. Cross-reference with memory pressure and storage class configuration before diving into infrastructure-specific issues like CSI drivers.

1Rule out disk capacity as the primary issue

Check `kubernetes_filesystem_usage_pct` first — if it's under 80% but you're still seeing slow queries, you're likely dealing with I/O contention, not a full disk. This is a critical distinction because the `disk-i-o-bottleneck-masquerading-as-application-slowness` insight shows that moderate filesystem usage combined with application latency almost always means inadequate IOPS or bandwidth, not lack of space. If the disk is >90% full, address capacity first before continuing.

Disk I/O Bottleneck Masquerading as Application Slowness kubernetes_filesystem_usage_pct

2Look for disk I/O stalls and operation latency

Examine `kubernetes_diskio_io_service_size_stats` for elevated I/O operation times — anything consistently over 20 seconds indicates severe disk stalls. Correlate these spikes with your database query slowdowns to confirm the I/O bottleneck hypothesis. Container restarts without clear CPU or memory causes are often triggered by I/O timeouts that manifest as generic application failures.

Disk I/O Bottleneck Masquerading as Application Slowness kubernetes_diskio_io_service_size_stats

3Check for memory pressure causing swap-to-disk

Review `kubernetes_memory_usage` against pod memory limits — if memory is near capacity, the OS may be swapping to disk, which will absolutely murder database performance. Look for major page faults in your monitoring; this shows the system is using disk as overflow for memory. If memory usage is consistently above 85% of limits, you have a memory problem disguised as a storage problem.

Page Faults Indicate Memory Swapping kubernetes_memory_usage

4Audit your storage class and IOPS provisioning

Verify that your PersistentVolumes are backed by storage classes with sufficient IOPS and throughput for database workloads. In AWS, for example, gp2 volumes have IOPS that scale with size (3 IOPS per GB), while gp3 lets you provision IOPS independently — a 100GB gp2 only gets 300 IOPS, which is woefully inadequate for most databases. Check your cloud provider's storage class definitions and ensure they match your workload's I/O requirements (typically 3000+ IOPS for transactional databases).

Disk I/O Bottleneck Masquerading as Application Slowness

5Analyze disk usage growth trends

Plot `kubernetes_filesystem_usage` over time to see if you're on a trajectory toward capacity exhaustion. Progressive growth means you'll need to expand PVCs or implement data retention policies soon, even if capacity isn't the immediate bottleneck. Rapid growth combined with high I/O latency suggests your storage backend can't keep up with write volume, not just size.

Disk Saturation from Vector Growth kubernetes_filesystem_usage

6Verify CSI driver installation (AWS EBS users on K8s 1.23+)

If you're running Kubernetes 1.23 or higher on AWS with EBS-backed volumes, confirm the EBS CSI driver add-on is installed and properly configured. Without it, persistent volume operations will fail or perform unpredictably. This is a common gotcha after cluster upgrades that can cause intermittent storage issues that look like I/O problems but are actually provisioning failures.

Kafka on Kubernetes 1.23+ with AWS EBS requires CSI driver