CoreDNS Performance and DNS Resolution Timeouts

criticalIncident Response

Troubleshoot DNS resolution failures and timeouts caused by CoreDNS performance issues.

Prompt: “My pods are getting DNS timeout errors when trying to resolve service names or external domains — is CoreDNS overwhelmed, misconfigured, or hitting resource limits?”

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When troubleshooting DNS timeouts in Kubernetes, start by checking if CoreDNS itself is resource-starved or crashing, then verify network connectivity between pods and CoreDNS. After ruling out these critical issues, investigate query amplification from ndots configuration and pod distribution problems that can cause intermittent failures.

1Check CoreDNS resource utilization and health

The most common cause of DNS timeouts is CoreDNS running out of resources. Check `kubernetes_cpu_usage` and `kubernetes_memory_usage` for CoreDNS pods—if memory usage exceeds 80% of limits or CPU is consistently maxed out, you've found your culprit. Look for OOMKilled events in pod status, which indicate the insight `coredns-resource-starvation-under-load` is affecting you. If CoreDNS is resource-starved, scale vertically (increase limits to 512Mi memory and 1000m CPU) or horizontally (3+ replicas for high-traffic clusters).

CoreDNS Resource Starvation Under Load kubernetes_cpu_usagekubernetes_memory_usage

2Verify network connectivity from pods to CoreDNS

Before blaming CoreDNS performance, confirm pods can actually reach it. Test connectivity from an application pod to the CoreDNS service IP (typically 10.96.0.10:53) using `nc -zv 10.96.0.10 53`. If you get 'connection refused' while CoreDNS pods are healthy, you've hit the `networkpolicy-blocking-dns-traffic` issue. Check for overly restrictive NetworkPolicies in both the application namespace and kube-system—you need to allow UDP/TCP port 53 egress to pods matching `k8s-app=kube-dns`. Monitor `kubernetes_network_errors` for packet drops or connection failures.

NetworkPolicy Blocking DNS Traffic kubernetes_network_errors

3Look for DNS query amplification from ndots configuration

If DNS failures are intermittent and spike during pod scaling events, you're likely hitting the `intermittent-dns-failures-from-ndots-search-amplification` problem. The default ndots:5 setting causes each lookup to try multiple search domains before attempting the FQDN, creating a query stampede. Calculate your queries-per-application-request ratio—if it exceeds 5:1, reduce ndots to 2 in pod dnsConfig for applications making external calls, or use fully qualified domain names with trailing dots in your code (e.g., 'api.example.com.'). Watch for increased `kubernetes_network_rx_size` and `kubernetes_network_transaction_size` during these events.

Intermittent DNS Failures from ndots Search Amplification kubernetes_network_rx_sizekubernetes_network_transaction_size

4Check CoreDNS pod distribution across nodes

Multiple CoreDNS pods scheduled on the same node create single points of failure and uneven load distribution. Run `kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide` to check if you're hitting the `coredns-pod-anti-affinity-violations` issue. If multiple replicas are co-located, configure pod anti-affinity with `preferredDuringSchedulingIgnoredDuringExecution` and `topologyKey: kubernetes.io/hostname` to spread them across nodes. This improves reliability and prevents one node's problems from cascading into cluster-wide DNS failures.

CoreDNS Pod Anti-Affinity Violations

5Investigate CPU steal from shared infrastructure

If you see inconsistent DNS latency that's hard to attribute—fast sometimes, slow others—and you're running on shared CPU instances, you're experiencing the `shared-cpu-droplet-impact-on-coredns-stability` problem. CPU steal from noisy neighbors causes unpredictable performance that won't show up clearly in `kubernetes_cpu_usage` metrics. Migrate CoreDNS pods to dedicated CPU instances using node affinity rules with `node.kubernetes.io/instance-type: dedicated-cpu`. Compare DNS response times between shared and dedicated nodes to confirm the correlation.

Shared CPU Droplet Impact on CoreDNS Stability kubernetes_cpu_usage