CoreDNS Performance and DNS Resolution Timeouts

criticalIncident Response

Troubleshoot DNS resolution failures and timeouts caused by CoreDNS performance issues.

Prompt: My pods are getting DNS timeout errors when trying to resolve service names or external domains — is CoreDNS overwhelmed, misconfigured, or hitting resource limits?

Agent Playbook

When an agent encounters this scenario, Schema provides these diagnostic steps automatically.

When troubleshooting DNS timeouts in Kubernetes, start by checking if CoreDNS itself is resource-starved or crashing, then verify network connectivity between pods and CoreDNS. After ruling out these critical issues, investigate query amplification from ndots configuration and pod distribution problems that can cause intermittent failures.

1Check CoreDNS resource utilization and health
The most common cause of DNS timeouts is CoreDNS running out of resources. Check `kubernetes_cpu_usage` and `kubernetes_memory_usage` for CoreDNS pods—if memory usage exceeds 80% of limits or CPU is consistently maxed out, you've found your culprit. Look for OOMKilled events in pod status, which indicate the insight `coredns-resource-starvation-under-load` is affecting you. If CoreDNS is resource-starved, scale vertically (increase limits to 512Mi memory and 1000m CPU) or horizontally (3+ replicas for high-traffic clusters).
2Verify network connectivity from pods to CoreDNS
Before blaming CoreDNS performance, confirm pods can actually reach it. Test connectivity from an application pod to the CoreDNS service IP (typically 10.96.0.10:53) using `nc -zv 10.96.0.10 53`. If you get 'connection refused' while CoreDNS pods are healthy, you've hit the `networkpolicy-blocking-dns-traffic` issue. Check for overly restrictive NetworkPolicies in both the application namespace and kube-system—you need to allow UDP/TCP port 53 egress to pods matching `k8s-app=kube-dns`. Monitor `kubernetes_network_errors` for packet drops or connection failures.
3Look for DNS query amplification from ndots configuration
If DNS failures are intermittent and spike during pod scaling events, you're likely hitting the `intermittent-dns-failures-from-ndots-search-amplification` problem. The default ndots:5 setting causes each lookup to try multiple search domains before attempting the FQDN, creating a query stampede. Calculate your queries-per-application-request ratio—if it exceeds 5:1, reduce ndots to 2 in pod dnsConfig for applications making external calls, or use fully qualified domain names with trailing dots in your code (e.g., 'api.example.com.'). Watch for increased `kubernetes_network_rx_size` and `kubernetes_network_transaction_size` during these events.
4Check CoreDNS pod distribution across nodes
Multiple CoreDNS pods scheduled on the same node create single points of failure and uneven load distribution. Run `kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide` to check if you're hitting the `coredns-pod-anti-affinity-violations` issue. If multiple replicas are co-located, configure pod anti-affinity with `preferredDuringSchedulingIgnoredDuringExecution` and `topologyKey: kubernetes.io/hostname` to spread them across nodes. This improves reliability and prevents one node's problems from cascading into cluster-wide DNS failures.
5Investigate CPU steal from shared infrastructure
If you see inconsistent DNS latency that's hard to attribute—fast sometimes, slow others—and you're running on shared CPU instances, you're experiencing the `shared-cpu-droplet-impact-on-coredns-stability` problem. CPU steal from noisy neighbors causes unpredictable performance that won't show up clearly in `kubernetes_cpu_usage` metrics. Migrate CoreDNS pods to dedicated CPU instances using node affinity rules with `node.kubernetes.io/instance-type: dedicated-cpu`. Compare DNS response times between shared and dedicated nodes to confirm the correlation.

Technologies

Related Insights

CoreDNS Resource Starvation Under Load
critical
Insufficient CPU or memory allocation causes CoreDNS to become unresponsive or crash under high query loads, manifesting as OOMKilled events and DNS resolution timeouts.
CoreDNS Pod Anti-Affinity Violations
warning
Multiple CoreDNS pods scheduled on the same node create single points of failure and uneven query load distribution, reducing reliability and causing localized performance issues.
Intermittent DNS Failures from ndots Search Amplification
warning
High ndots setting (default 5) causes excessive DNS queries as each name is tried with all search domains before FQDN lookup, leading to DNS cache stampede and intermittent failures during pod scaling events.
NetworkPolicy Blocking DNS Traffic
critical
Overly restrictive NetworkPolicies prevent pods from reaching CoreDNS service, causing 'connection refused' or timeout errors that appear as application-level DNS failures rather than network issues.
Shared CPU Droplet Impact on CoreDNS Stability
warning
CoreDNS running on shared CPU instances experiences intermittent slowness due to CPU steal, particularly during neighbor workload spikes, causing unpredictable DNS latency that's difficult to attribute.
DNS resolution delays eat into exporter timeout budget
warning

Relevant Metrics

Monitoring Interfaces

Kubernetes Datadog