HPA Autoscaling Decision and Right-Sizing

warningCapacity Planning

Determine optimal HPA configuration and whether pods are properly right-sized for effective horizontal autoscaling.

Prompt: I'm setting up HPA for my deployment but not sure if my pods are right-sized or if my target CPU/memory thresholds make sense — can you help me analyze whether HPA will scale effectively or if I need to adjust resource requests first?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When evaluating HPA effectiveness and pod right-sizing, start by confirming resource requests are defined (HPA's foundation), then analyze actual utilization vs requests to identify over/under-provisioning. Next review your HPA target thresholds and workload characteristics before checking for oscillation patterns and stabilization settings.

1Verify all pods have resource requests defined
Before HPA can work at all, you need explicit CPU and memory requests — the scheduler uses these to calculate utilization percentages for autoscaling decisions. Check `kubernetes_cpu_requested` and `kubernetes_memory_requested` for your deployment. If any pods show zero or undefined requests, HPA will fail to scale predictably and you'll see unpredictable pod placement across nodes. Start with modest values like 100m CPU and 128Mi memory, then refine based on actual usage.
2Analyze current utilization vs requested resources
Compare `kubernetes_cpu_usage` to `kubernetes_cpu_requested` and same for memory over a representative traffic period (ideally 7-14 days). If utilization consistently stays below 30-40% of requested resources, your pods are over-provisioned and wasting capacity — HPA will struggle to trigger because you'll never hit reasonable thresholds. If utilization regularly exceeds 80-90%, pods are under-provisioned and you're risking OOM kills or CPU throttling before HPA can react. Aim for 50-70% average utilization under normal load to give HPA room to work.
3Review HPA target CPU/memory thresholds
Check your HPA configuration's target utilization percentage — 60-70% CPU target (65% is typical) provides headroom for traffic bursts while avoiding wasteful over-provisioning. If you've set targets too high (80-90%), HPA won't scale until pods are already struggling. Too low (30-40%) and you'll scale excessively, wasting money. Also verify min replicas ≥3 for high availability and that max replicas won't exceed cluster capacity. Set scaleUp behavior to 100% per 60s for rapid response to spikes.
4Identify if workload is I/O-bound vs CPU-bound
For async workloads (FastAPI, Node.js, Go with heavy database/API calls), CPU-only HPA often fails because `kubernetes_cpu_usage` stays low (~40-50%) even when request queues are growing and latency is spiking. If you see low CPU utilization but degraded performance under load, you're likely I/O-bound and need custom metrics like request rate (QPS) or queue depth from Prometheus Adapter. Target 70-80% of your max sustainable QPS as the HPA metric instead of CPU.
5Check for rapid scale-up/down oscillation
Look at your deployment's replica count history over the past few hours or days. If you see rapid cycling — scaling from 5→10→5→8→6 pods within minutes — you have oscillation from missing stabilization windows or inappropriate thresholds. This causes pod churn, cold starts, cache invalidation, and latency spikes. Monitor Kubernetes events for frequent pod creation/deletion patterns and check if traffic patterns are spiky vs steady to determine if this is configuration or workload-driven.
6Configure stabilization windows to prevent thrashing
Set HPA v2's `behavior.scaleDown.stabilizationWindowSeconds` to 300 seconds (5 minutes) to prevent rapid scale-down during brief traffic dips. This keeps caches warm and reduces churn during spiky traffic patterns. For scale-up, use `stabilizationWindowSeconds: 0` to react quickly to load increases. The asymmetry makes sense: you want to scale up fast when load hits, but scale down slowly to avoid yo-yoing during variable traffic.

Technologies

Related Insights

HPA baseline misconfiguration causes pod thrashing
warning
CPU-only HPA fails to scale I/O-bound async FastAPI under load
warning
Missing HPA stabilization window causes scale-down thrashing
warning
Kubecost Over-Requested Container Waste
warning
Containers requesting significantly more CPU or memory than they actually use lead to node overprovisioning and wasted cloud spend. Kubecost identifies these inefficiencies through usage vs. request analysis.
Pod CPU and Memory Underutilization Driving Cost Waste
info
Consistently low CPU utilization and memory usage in pods indicates over-provisioned resource requests, leading to wasted node capacity and unnecessary infrastructure costs that can be optimized through right-sizing.
Autoscaler misconfiguration causes rapid pod churn in Kubernetes
warning
Missing Resource Requests Cause Unpredictable Pod Placement
critical
Pods deployed without CPU/memory requests lead to scheduler misplacement, resource contention, and OOMKilled containers. The scheduler cannot reserve appropriate resources, resulting in too many pods on single nodes and performance degradation.

Relevant Metrics

Monitoring Interfaces

Kubernetes Datadog