Tasks fail or change state not by executor action but by scheduler intervention (e.g., parsing timeout, queue timeout, heartbeat timeout), causing confusion when reviewing logs and indicating resource or configuration issues.
Tasks execute on different workers/containers without unified tracing, making it impossible to understand end-to-end latency, trace failures to root cause, or see downstream impact of errors across the pipeline.
KeepAliveTimeout set too high causes idle connections to consume worker threads/processes, reducing effective concurrency. KeepAliveTimeout set too low forces excessive TCP handshakes, increasing latency. Optimal values depend on traffic profile: APIs need 2-5s, static content 5-15s.
Execution dates lag far behind current time, indicating scheduler cannot keep up with DAG parsing and task scheduling load, leading to delayed pipeline runs and stale data.
DAGs fail to appear in UI because Python syntax errors, import failures, or missing DAG object prevent successful parsing, blocking all task execution for that DAG.
Tasks remain queued indefinitely because pool slots are fully consumed by long-running or stuck tasks, blocking execution even when workers are available.
Tasks fail with exit code -9 (SIGKILL) or lose connection to scheduler when worker process exceeds memory limits, often leaving no clear error in task logs.
DAGs run at unexpected times or miss scheduled windows due to implicit UTC vs. local timezone handling, leading to data freshness issues and business impact.
Heavy Python imports, dynamic code execution, or top-level database queries in DAG files cause parsing to take seconds per file, creating scheduler bottleneck and delaying task starts.
StatsD metrics emitted by Airflow use flat naming (dag.task.duration) without structured labels, making it difficult to aggregate, filter, or correlate metrics across DAGs, tasks, and runs in Prometheus/Grafana dashboards.
High rates of hitpass in Varnish suggest requests are bypassing cache and hitting the backend directly, potentially overloading Apache or other upstream servers. This commonly occurs when cache policies are misconfigured or when content is marked as uncacheable.
Rising varnish_backend_fail and varnish_backend_toolate metrics indicate Varnish is timing out waiting for Apache responses, often due to misaligned timeout configurations between proxy and origin.
Elevated varnish_sess_drop signals that Varnish is rejecting connections due to worker thread exhaustion or queue overflow, often caused by slow backend responses keeping workers occupied.
Long KeepAliveTimeout values in Apache combined with high concurrent connections can starve worker threads, causing Varnish to see backends as busy or unresponsive even when Apache has available capacity.
Elevated varnish_esi_errors or varnish_esi_warnings indicate Edge Side Includes processing failures, which can trigger multiple backend requests per page load and overwhelm Apache with cascading sub-requests.
In Airflow deployments using CeleryExecutor with RabbitMQ as the broker, frequent DAG parsing combined with high Celery task rates can create CPU contention on the RabbitMQ broker, especially when handling task result messages.
Apache prefork MPM's process-per-connection model becomes memory-inefficient when fronted by Varnish, as cache misses can trigger sudden process spawn bursts that exhaust available RAM and force swapping.
Webserver repeatedly crashes and restarts (503 errors) when DAG loading exceeds timeout threshold, preventing UI access even while scheduler and workers continue operating.
Sensor tasks run in poke mode occupy worker slots indefinitely while waiting for conditions, reducing available concurrency for productive work and causing task backlog.