KubernetesCoreDNS

DNS resolution failures for code location pods in Dagster+

critical
Connection ManagementUpdated Oct 7, 2024(via Exa)
How to detect:

In Dagster+ deployments, sensors fail with DNS resolution errors for code location hostnames (e.g., 'datapipelines-prod-f48cdf.dagster-agent-acme-company-prod-0a5ef9e6f0a0.local:4000'). The error 'Domain name not found' suggests pod DNS records are not propagating or pods are being destroyed before DNS cleanup. Issue frequency increased from once every few days to every 2 hours, impacting even daily batch jobs.

Recommended action:

Verify Kubernetes DNS service (CoreDNS) is healthy and properly configured. Check for pod churn/restart patterns that might cause stale DNS entries. Review network policies and service mesh configuration. Investigate DNS caching settings and TTL values. For Dagster+ specifically, contact support as this may relate to agent-to-code-location networking configuration.