Which resources are significantly over-provisioned relative to peak utilization?
Identify resources where peak utilization consistently stays well below provisioned capacity; include instance type, observed utilization range, and right-sizing recommendation.
I'll help you analyze resource utilization and identify cost optimization opportunities across your GKE, Cloud SQL, and ElastiCache infrastructure. Here's a systematic approach:
1. Kubernetes Resource Analysis
Check Current Resource Requests vs Usage
# Get resource requests vs actual usage for all pods
kubectl top pods --all-namespaces --containers | \
awk 'NR>1 {cpu+=$2; mem+=$3} END {print "Total CPU usage:", cpu"m"; print "Total Memory usage:", mem"Mi"}'
# Compare with resource requests
kubectl get pods --all-namespaces -o custom-columns=\
"NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory"
Install and Use Vertical Pod Autoscaler (VPA) for Recommendations
# vpa-recommender.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: your-app
updatePolicy:
updateMode: "Off" # Just recommendations, no auto-scaling
resourcePolicy:
containerPolicies:
- containerName: '*'
controlledResources: ["cpu", "memory"]
# Get VPA recommendations
kubectl get vpa app-vpa -o yaml | grep -A 10 recommendation
Node Utilization Analysis
# Check node resource utilization
kubectl top nodes
# Get detailed node capacity vs allocation
kubectl describe nodes | grep -A 5 "Allocated resources"
# Check for node over-provisioning
kubectl get nodes -o custom-columns="NAME:.metadata.name,CPU_ALLOCATABLE:.status.allocatable.cpu,MEMORY_ALLOCATABLE:.status.allocatable.memory"
2. GKE Cluster Optimization
Enable Cluster Autoscaler and Check Configuration
# Check current autoscaler settings
gcloud container clusters describe CLUSTER_NAME --zone=ZONE | grep -A 10 autoscaling
# Update autoscaler if needed
gcloud container clusters update CLUSTER_NAME \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--zone=ZONE
Node Pool Analysis
# List node pools and their configurations
gcloud container node-pools list --cluster=CLUSTER_NAME --zone=ZONE
# Check for right-sized machine types
kubectl get nodes -o custom-columns="NAME:.metadata.name,INSTANCE_TYPE:.metadata.labels.beta\.kubernetes\.io/instance-type"
3. Cloud SQL PostgreSQL Analysis
Check Current Usage Metrics
# Get CPU and memory utilization (last 7 days)
gcloud sql instances describe INSTANCE_NAME --format="value(settings.tier)"
# Check database connections
gcloud sql operations list --instance=INSTANCE_NAME --limit=10
Monitor Key Metrics
-- Connect to your PostgreSQL instance and run these queries:
-- Check connection usage
SELECT count(*) as active_connections
FROM pg_stat_activity
WHERE state = 'active';
-- Check database sizes
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Check slow queries
SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
Cloud SQL Recommendations Script
#!/bin/bash
# cloudsql-analysis.sh
INSTANCE_NAME="your-instance"
PROJECT_ID="your-project"
echo "=== Cloud SQL Instance Analysis ==="
echo "Current tier:"
gcloud sql instances describe $INSTANCE_NAME --format="value(settings.tier)"
echo -e "\nCPU utilization (avg over 7 days):"
gcloud monitoring timeseries list \
--filter="resource.type=\"gce_instance\" AND metric.type=\"cloudsql.googleapis.com/database/cpu/utilization\"" \
--interval-end-time=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--interval-start-time=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
--project=$PROJECT_ID
echo -e "\nMemory utilization:"
gcloud monitoring timeseries list \
--filter="resource.type=\"gce_instance\" AND metric.type=\"cloudsql.googleapis.com/database/memory/utilization\"" \
--interval-end-time=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--interval-start-time=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ) \
--project=$PROJECT_ID
4. ElastiCache Redis Analysis
Since you mentioned ElastiCache (AWS service), but you're using GKE (Google Cloud), I'll provide both:
For AWS ElastiCache:
# Check cache cluster utilization
aws elasticache describe-cache-clusters --show-cache-node-info
# Get CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ElastiCache \
--metric-name CPUUtilization \
--dimensions Name=CacheClusterId,Value=your-cluster-id \
--statistics Average \
--start-time $(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 3600
For Google Cloud Memorystore (Redis):
# Check Redis instance details
gcloud redis instances describe INSTANCE_NAME --region=REGION
# Monitor memory usage
gcloud monitoring timeseries list \
--filter="resource.type=\"redis_instance\" AND metric.type=\"redis.googleapis.com/stats/memory/usage_ratio\"" \
--interval-end-time=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
--interval-start-time=$(date -u -d '7 days ago' +%Y-%m-%dT%H:%M:%SZ)
5. Comprehensive Cost Analysis Script
#!/bin/bash
# resource-optimization.sh
echo "=== Kubernetes Resource Optimization Report ==="
echo "Date: $(date)"
echo
echo "1. Node Utilization:"
kubectl top nodes
echo
echo "2. Pod Resource Requests vs Actual Usage:"
kubectl get pods --all-namespaces -o custom-columns=\
"NAMESPACE:.metadata.namespace,NAME:.metadata.name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_ACTUAL:.status.containerStatuses[*].usage.cpu,MEM_ACTUAL:.status.containerStatuses[*].usage.memory"
echo
echo "3. Pods without resource requests (potential waste):"
kubectl get pods --all-namespaces -o json | jq -r '.items[] | select(.spec.containers[].resources.requests == null) | "\(.metadata.namespace)/\(.metadata.name)"'
echo
echo "4. High CPU/Memory consumers:"
kubectl top pods --all-namespaces --sort-by=cpu | head -10
echo
kubectl top pods --all-namespaces --sort-by=memory | head -10
6. Key Optimization Recommendations
Immediate Actions:
- Set appropriate resource requests/limits:
resources:
requests:
cpu: "100m" # Based on actual usage
memory: "128Mi"
limits:
cpu: "500m" # 20-30% above requests
memory: "256Mi"
- Enable Horizontal Pod Autoscaler:
kubectl autoscale deployment your-app --cpu-percent=70 --min=2 --max=10
- Use spot/preemptible instances:
gcloud container node-pools create spot-pool \
--cluster=CLUSTER_NAME \
--preemptible \
--machine-type=e2-medium \
--zone=ZONE
Monitoring Setup:
# prometheus-rules.yaml
groups:
- name: resource-usage
rules:
- alert: HighResourceRequest
expr: |
(
sum(kube_pod_container_resource_requests_cpu_cores) /
sum(kube_node_status_allocatable_cpu_cores)
) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU resource requests"
Run this analysis weekly and adjust resources based on actual usage patterns. Target 70-80% resource utilization for optimal cost-performance balance.
Excellent! I now have comprehensive guidance for your cost optimization analysis across all three components of your stack. Here's a systematic approach to identify over-provisioning and reduce costs:
Kubernetes Cost Optimization Analysis
🎯 Step 1: Measure Overall Cluster Efficiency
Start by calculating your cluster efficiency ratio to quantify the scope of waste:
# Get cluster-wide efficiency metrics
kubectl top nodes
kubectl top pods --all-namespaces
# Calculate usage-to-request ratios
# CPU efficiency = kubernetes_cpu_usage ÷ kubernetes_cpu_requested
# Memory efficiency = kubernetes_memory_usage ÷ kubernetes_memory_requested
🚨 Red Flag: If your usage-to-request ratio is below 50%, you have significant overprovisioning driving up costs.
Step 2: Quick Wins - Find Abandoned Resources
These represent pure waste (typically 20-30% of cloud spend):
- Deployments with zero replicas
- Unattached PVCs
- Idle LoadBalancer services
- Unused persistent volumes
# Find zero-replica deployments
kubectl get deployments --all-namespaces -o json | jq '.items[] | select(.spec.replicas == 0) | .metadata.name'
# Find unattached PVCs
kubectl get pvc --all-namespaces --no-headers | awk '$2 == "Available"'
# Find idle services
kubectl get svc --all-namespaces -o wide
Step 3: Identify Worst Container Offenders
Focus on containers using <30% of their requested resources, prioritizing high-request containers:
# Monitor resource utilization
kubectl top pods --containers --all-namespaces
# Look for patterns in resource requests vs usage
kubectl describe nodes | grep -A 5 "Allocated resources"
Step 4: Check for Resource Reservation Issues
Critical Pattern: Nodes showing:
kubernetes_cpu_requested+kubernetes_memory_requested> 80% of capacity- But actual
kubernetes_cpu_usage+kubernetes_memory_usage< 50%
This causes new pods to fail scheduling despite plenty of actual capacity, forcing unnecessary cluster scaling.
Cloud SQL PostgreSQL Optimization
Step 1: Connection Overhead Analysis
⚠️ Major Cost Driver: Each PostgreSQL connection consumes 5-10MB RAM before any queries execute.
-- Check current connections
SELECT count(*) FROM pg_stat_activity;
-- Analyze connection states
SELECT state, COUNT(*) AS connection_count
FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
GROUP BY state;
Key Insight: If you're running >200 connections, that's 1-2GB RAM consumed just for connection overhead. You need PgBouncer connection pooling, not a bigger instance.
Step 2: Cache Hit Ratio Baseline
-- Calculate cache hit ratio (should be >99%)
SELECT
round(100.0 * sum(blks_hit) / (sum(blks_hit) + sum(blks_read)), 2) as cache_hit_ratio
FROM pg_stat_database;
If 95-98%: Your instance is under-provisioned on memory, not over-provisioned. The 85% CPU spikes are from I/O waits.
Step 3: Check for Memory Tuning Issues
-- Check for temp file usage (indicates work_mem too low)
SELECT datname, temp_files, temp_bytes
FROM pg_stat_database
WHERE temp_files > 0;
Non-zero temp files = queries spilling to disk = CPU spikes from unnecessary I/O work.
Step 4: Downsize Safety Criteria
You can safely downsize one instance tier if:
- Cache hit ratio >99%
temp_filesnear zero over 7 days- Active connections <100
- CPU spikes don't exceed 85% for >5 consecutive minutes
ElastiCache Redis Optimization
Step 1: Memory Fragmentation Check
# Check fragmentation ratio via Redis CLI
redis-cli INFO memory | grep mem_fragmentation_ratio
Critical: If ratio >1.5, your 45% memory usage might actually represent only 30% effective memory due to fragmentation.
Step 2: Workload Pattern Analysis
# Check ops/sec patterns
redis-cli INFO stats | grep instantaneous_ops_per_sec
# Monitor command distribution
redis-cli INFO commandstats
Decision Matrix:
- <50K ops/sec + mostly GET/SET: You're probably overpaying for memory-optimized instances (r6g) when balanced instances (m6g/m7g) would work
- >100K ops/sec + complex commands: CPU becomes the bottleneck; might need compute-optimized
Step 3: Memory Headroom Assessment
Rule of thumb: Don't downsize unless your 95th percentile memory usage stays below 65% of maxmemory with 20% headroom for growth.
# Check peak memory patterns
redis-cli INFO memory | grep used_memory_peak
redis-cli CONFIG GET maxmemory
💰 Cost Optimization Action Plan
Immediate Actions (0-1 week)
- Clean up abandoned resources (20-30% cost reduction potential)
- Deploy PgBouncer for PostgreSQL connection pooling
- Enable Redis active defragmentation if fragmentation >1.5
Short-term (1-4 weeks)
- Right-size Kubernetes resource requests to match P75 usage
- Test PostgreSQL instance downsize during low-traffic hours
- Evaluate Redis instance family change (r6g → m6g if workload allows)
Monitoring Setup
# Key metrics to track cost optimization impact
# Kubernetes: usage-to-request ratios, node utilization
# PostgreSQL: connection count, cache hit ratio, temp files
# Redis: fragmentation ratio, memory utilization, ops/sec
Expected Savings
- Kubernetes: 30-50% reduction in compute costs through right-sizing
- PostgreSQL: 20-40% savings through instance downsizing + connection pooling
- Redis: 15-25% through instance type optimization
The key insight is that most over-provisioning stems from configuration issues (too many connections, poor memory settings, fragmented memory) rather than actual resource needs. Fix the configuration first, then right-size the infrastructure.