Pod Startup Delays and ImagePullBackOff
warningIncident Response
Diagnose why pods are slow to start or stuck in ImagePullBackOff, delaying application availability.
Prompt: “My pods are stuck in ImagePullBackOff or taking forever to start up — is this an image registry issue, network problem, image size problem, or something with the node's image cache?”
Agent Playbook
When an agent encounters this scenario, Schema provides these diagnostic steps automatically.
When diagnosing ImagePullBackOff or slow pod startup, start by examining the actual pod events to identify the specific failure mode (auth, network, timeout). Then verify image pull secrets are correctly configured in the right namespace, check network connectivity to the registry, and finally investigate node resource constraints that might prevent image caching or pod scheduling.
1Check pod events for the specific error message
Run `kubectl describe pod <pod-name>` and look at the Events section for the exact failure reason. ImagePullBackOff is often a symptom of different root causes: 'unauthorized' or 'authentication required' points to secret issues, 'connection refused' or 'timeout' indicates network problems, and 'image not found' suggests registry or image name problems. The error message will guide which path to investigate next.
2Verify image pull secrets exist and match exactly
This is the most common cause I've seen. Check that the secret exists in the correct namespace with `kubectl get secret <secret-name> -n <namespace>`. Even a single character typo in the secret name referenced in your deployment spec will cause silent authentication failures. Compare the imagePullSecrets name in your pod spec to the actual secret name—Kubernetes won't warn you about mismatches.
3Check network connectivity between nodes and registry
Monitor `kubernetes_network_errors` for spikes during image pull attempts—any non-zero values indicate connectivity issues. If you're on Azure AKS (especially versions 1.29.7, 1.31.5, or 1.31.6), you may be hitting known network stability issues that cause API calls to hang. Check `kubernetes_network_rx_size` to see if bytes are actually being received; if it's zero or very low during a pull, the network path to the registry is blocked or severely degraded.
4Investigate node resource exhaustion preventing scheduling
If pods stay in Pending status for 60+ seconds before failing, you're likely hitting resource constraints. Check `kubernetes_cpu_usage` and `kubernetes_memory_usage` across nodes—if nodes are at 90%+ capacity, the scheduler can't place pods even if images are cached. Also verify `kubernetes_diskio_io_service_size_stats` to ensure nodes have sufficient disk space for image layers; a full disk prevents both image pulls and pod starts.
5Assess image size and network throughput for pull timeouts
Large container images (multiple GB) can timeout during pulls if network throughput is low. Watch `kubernetes_network_rx_size` during a pull—if you're seeing <10MB/s on a multi-GB image, the pull will take minutes and may timeout. Check `kubernetes_diskio_io_service_size_stats` to verify disk I/O isn't bottlenecking the image layer extraction. Consider using smaller base images or implementing a local registry mirror if pulls consistently timeout.
6Verify registry-specific configuration and authentication
For internal or insecure registries, check if you need special configuration in your container runtime. BuildKit users need to ensure registry authentication is properly configured and that BuildKit pods aren't restarting due to resource exhaustion (which loses cached credentials). For private registries, verify the docker-registry secret contains valid, non-expired credentials. Some registries also enforce rate limits that manifest as intermittent pull failures.
Technologies
Related Insights
BuildKit Registry Authentication Failures
critical
CrewAI crew builds fail when BuildKit cannot authenticate to container registries, causing silent build failures and preventing crew deployment updates. Network connectivity issues or missing registry secrets compound the problem.
Azure AKS network connectivity issues cause daemon API call hangs
critical
Kubernetes image pull secrets fail on namespace or name mismatch
warning
Kubernetes pods remain Pending and never start after 60 seconds
critical