HPA Autoscaling Decision and Right-Sizing

warningCapacity Planning

Determine optimal HPA configuration and whether pods are properly right-sized for effective horizontal autoscaling.

Prompt: “I'm setting up HPA for my deployment but not sure if my pods are right-sized or if my target CPU/memory thresholds make sense — can you help me analyze whether HPA will scale effectively or if I need to adjust resource requests first?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When evaluating HPA effectiveness and pod right-sizing, start by confirming resource requests are defined (HPA's foundation), then analyze actual utilization vs requests to identify over/under-provisioning. Next review your HPA target thresholds and workload characteristics before checking for oscillation patterns and stabilization settings.

1Verify all pods have resource requests defined

Before HPA can work at all, you need explicit CPU and memory requests — the scheduler uses these to calculate utilization percentages for autoscaling decisions. Check `kubernetes_cpu_requested` and `kubernetes_memory_requested` for your deployment. If any pods show zero or undefined requests, HPA will fail to scale predictably and you'll see unpredictable pod placement across nodes. Start with modest values like 100m CPU and 128Mi memory, then refine based on actual usage.

Missing Resource Requests Cause Unpredictable Pod Placement kubernetes_cpu_requestedkubernetes_memory_requested

2Analyze current utilization vs requested resources

Compare `kubernetes_cpu_usage` to `kubernetes_cpu_requested` and same for memory over a representative traffic period (ideally 7-14 days). If utilization consistently stays below 30-40% of requested resources, your pods are over-provisioned and wasting capacity — HPA will struggle to trigger because you'll never hit reasonable thresholds. If utilization regularly exceeds 80-90%, pods are under-provisioned and you're risking OOM kills or CPU throttling before HPA can react. Aim for 50-70% average utilization under normal load to give HPA room to work.

Pod CPU and Memory Underutilization Driving Cost Waste Kubecost Over-Requested Container Waste kubernetes_cpu_usagekubernetes_cpu_requestedkubernetes_memory_usagekubernetes_memory_requested

3Review HPA target CPU/memory thresholds

Check your HPA configuration's target utilization percentage — 60-70% CPU target (65% is typical) provides headroom for traffic bursts while avoiding wasteful over-provisioning. If you've set targets too high (80-90%), HPA won't scale until pods are already struggling. Too low (30-40%) and you'll scale excessively, wasting money. Also verify min replicas ≥3 for high availability and that max replicas won't exceed cluster capacity. Set scaleUp behavior to 100% per 60s for rapid response to spikes.

HPA baseline misconfiguration causes pod thrashing kubernetes_cpu_usagekubernetes_cpu_requested

4Identify if workload is I/O-bound vs CPU-bound

For async workloads (FastAPI, Node.js, Go with heavy database/API calls), CPU-only HPA often fails because `kubernetes_cpu_usage` stays low (~40-50%) even when request queues are growing and latency is spiking. If you see low CPU utilization but degraded performance under load, you're likely I/O-bound and need custom metrics like request rate (QPS) or queue depth from Prometheus Adapter. Target 70-80% of your max sustainable QPS as the HPA metric instead of CPU.

CPU-only HPA fails to scale I/O-bound async FastAPI under load kubernetes_cpu_usage

5Check for rapid scale-up/down oscillation

Look at your deployment's replica count history over the past few hours or days. If you see rapid cycling — scaling from 5→10→5→8→6 pods within minutes — you have oscillation from missing stabilization windows or inappropriate thresholds. This causes pod churn, cold starts, cache invalidation, and latency spikes. Monitor Kubernetes events for frequent pod creation/deletion patterns and check if traffic patterns are spiky vs steady to determine if this is configuration or workload-driven.

Missing HPA stabilization window causes scale-down thrashing Autoscaler misconfiguration causes rapid pod churn in Kubernetes

6Configure stabilization windows to prevent thrashing

Set HPA v2's `behavior.scaleDown.stabilizationWindowSeconds` to 300 seconds (5 minutes) to prevent rapid scale-down during brief traffic dips. This keeps caches warm and reduces churn during spiky traffic patterns. For scale-up, use `stabilizationWindowSeconds: 0` to react quickly to load increases. The asymmetry makes sense: you want to scale up fast when load hits, but scale down slowly to avoid yo-yoing during variable traffic.

Missing HPA stabilization window causes scale-down thrashing