Pod Startup Delays and ImagePullBackOff

warningIncident Response

Diagnose why pods are slow to start or stuck in ImagePullBackOff, delaying application availability.

Prompt: “My pods are stuck in ImagePullBackOff or taking forever to start up — is this an image registry issue, network problem, image size problem, or something with the node's image cache?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When diagnosing ImagePullBackOff or slow pod startup, start by examining the actual pod events to identify the specific failure mode (auth, network, timeout). Then verify image pull secrets are correctly configured in the right namespace, check network connectivity to the registry, and finally investigate node resource constraints that might prevent image caching or pod scheduling.

1Check pod events for the specific error message

Run `kubectl describe pod <pod-name>` and look at the Events section for the exact failure reason. ImagePullBackOff is often a symptom of different root causes: 'unauthorized' or 'authentication required' points to secret issues, 'connection refused' or 'timeout' indicates network problems, and 'image not found' suggests registry or image name problems. The error message will guide which path to investigate next.

2Verify image pull secrets exist and match exactly

This is the most common cause I've seen. Check that the secret exists in the correct namespace with `kubectl get secret <secret-name> -n <namespace>`. Even a single character typo in the secret name referenced in your deployment spec will cause silent authentication failures. Compare the imagePullSecrets name in your pod spec to the actual secret name—Kubernetes won't warn you about mismatches.

Kubernetes image pull secrets fail on namespace or name mismatch BuildKit Registry Authentication Failures

3Check network connectivity between nodes and registry

Monitor `kubernetes_network_errors` for spikes during image pull attempts—any non-zero values indicate connectivity issues. If you're on Azure AKS (especially versions 1.29.7, 1.31.5, or 1.31.6), you may be hitting known network stability issues that cause API calls to hang. Check `kubernetes_network_rx_size` to see if bytes are actually being received; if it's zero or very low during a pull, the network path to the registry is blocked or severely degraded.

Azure AKS network connectivity issues cause daemon API call hangs kubernetes_network_errorskubernetes_network_rx_size

4Investigate node resource exhaustion preventing scheduling

If pods stay in Pending status for 60+ seconds before failing, you're likely hitting resource constraints. Check `kubernetes_cpu_usage` and `kubernetes_memory_usage` across nodes—if nodes are at 90%+ capacity, the scheduler can't place pods even if images are cached. Also verify `kubernetes_diskio_io_service_size_stats` to ensure nodes have sufficient disk space for image layers; a full disk prevents both image pulls and pod starts.

Kubernetes pods remain Pending and never start after 60 seconds kubernetes_cpu_usagekubernetes_memory_usagekubernetes_diskio_io_service_size_stats

5Assess image size and network throughput for pull timeouts

Large container images (multiple GB) can timeout during pulls if network throughput is low. Watch `kubernetes_network_rx_size` during a pull—if you're seeing <10MB/s on a multi-GB image, the pull will take minutes and may timeout. Check `kubernetes_diskio_io_service_size_stats` to verify disk I/O isn't bottlenecking the image layer extraction. Consider using smaller base images or implementing a local registry mirror if pulls consistently timeout.

kubernetes_network_rx_sizekubernetes_diskio_io_service_size_stats

6Verify registry-specific configuration and authentication

For internal or insecure registries, check if you need special configuration in your container runtime. BuildKit users need to ensure registry authentication is properly configured and that BuildKit pods aren't restarting due to resource exhaustion (which loses cached credentials). For private registries, verify the docker-registry secret contains valid, non-expired credentials. Some registries also enforce rate limits that manifest as intermittent pull failures.

BuildKit Registry Authentication Failures