Apache Airflow insights

Airflow health check fails to detect Celery worker queue consumer loss

celery.queue.consumers celery.worker.up celery.queue.length+1 more

github.com

2mo ago▸

Celery

celery.worker.up celery.queue.consumers celery.task.failed+1 more

github.com

2mo ago▸

Task State Changed Externally by Schedulerwarning

Tasks fail or change state not by executor action but by scheduler intervention (e.g., parsing timeout, queue timeout, heartbeat timeout), causing confusion when reviewing logs and indicating resource or configuration issues.

airflow.apache.org

3mo ago▸

Missing Observability Across Distributed Taskswarning

Tasks execute on different workers/containers without unified tracing, making it impossible to understand end-to-end latency, trace failures to root cause, or see downstream impact of errors across the pipeline.

KeepAlive Misconfiguration Exhausts Connections

4mo ago▸

NGINX

Scheduler Lag Indicates Resource Exhaustioncritical

KeepAliveTimeout set too high causes idle connections to consume worker threads/processes, reducing effective concurrency. KeepAliveTimeout set too low forces excessive TCP handshakes, increasing latency. Optimal values depend on traffic profile: APIs need 2-5s, static content 5-15s.

nginx_net_writing nginx_backend_handled nginx_ssl_handshakes+1 more

4mo ago▸

Execution dates lag far behind current time, indicating scheduler cannot keep up with DAG parsing and task scheduling load, leading to delayed pipeline runs and stale data.

4mo ago▸

DAG Not Appearing Due to Parse Errorscritical

DAGs fail to appear in UI because Python syntax errors, import failures, or missing DAG object prevent successful parsing, blocking all task execution for that DAG.

blog.dataengineerthings.org

4mo ago▸

Tasks Stuck in Queued State Due to Pool Exhaustionwarning

Tasks remain queued indefinitely because pool slots are fully consumed by long-running or stuck tasks, blocking execution even when workers are available.

4mo ago▸

Worker OOM Kills Tasks Without Clear Failurecritical

Tasks fail with exit code -9 (SIGKILL) or lose connection to scheduler when worker process exceeds memory limits, often leaving no clear error in task logs.

airflow.apache.org

4mo ago▸

Timezone Confusion Causes Unexpected Schedulingwarning

DAGs run at unexpected times or miss scheduled windows due to implicit UTC vs. local timezone handling, leading to data freshness issues and business impact.

4mo ago▸

Slow DAG Parsing Delays All Schedulingwarning

Heavy Python imports, dynamic code execution, or top-level database queries in DAG files cause parsing to take seconds per file, creating scheduler bottleneck and delaying task starts.

5mo ago▸

Task Metrics Lack Granular Labels for Analysisinfo

StatsD metrics emitted by Airflow use flat naming (dag.task.duration) without structured labels, making it difficult to aggregate, filter, or correlate metrics across DAGs, tasks, and runs in Prometheus/Grafana dashboards.

tracer.cloud

5mo ago▸

Varnish Cache Hitpass Indicates Backend Overload Risk

varnish_cache_hitpass varnish_backend_fail varnish_vbe_boot_default_busy

High rates of hitpass in Varnish suggest requests are bypassing cache and hitting the backend directly, potentially overloading Apache or other upstream servers. This commonly occurs when cache policies are misconfigured or when content is marked as uncacheable.

7mo ago▸

Varnish Backend Failures Signal Apache Timeout Misconfiguration

varnish_backend_fail varnish_backend_toolate varnish_vbe_boot_default_unhealthy

Rising varnish_backend_fail and varnish_backend_toolate metrics indicate Varnish is timing out waiting for Apache responses, often due to misaligned timeout configurations between proxy and origin.

7mo ago▸

Session Drop Rate Indicates Connection Queue Exhaustion

varnish_sess_drop varnish_backend_busy varnish_vbe_boot_default_busy

Elevated varnish_sess_drop signals that Varnish is rejecting connections due to worker thread exhaustion or queue overflow, often caused by slow backend responses keeping workers occupied.

7mo ago▸

Worker Thread Starvation from Keepalive Mismanagement

varnish_backend_reuse varnish_vbe_boot_default_busy varnish_sess_linger

Long KeepAliveTimeout values in Apache combined with high concurrent connections can starve worker threads, causing Varnish to see backends as busy or unresponsive even when Apache has available capacity.

7mo ago▸

ESI Processing Errors Cascade to Backend Overload

varnish_esi_errors varnish_esi_warnings varnish_backend_fail

Elevated varnish_esi_errors or varnish_esi_warnings indicate Edge Side Includes processing failures, which can trigger multiple backend requests per page load and overwhelm Apache with cascading sub-requests.

The Scheduler Parsing CPU Spiral

7mo ago▸

RabbitMQ

Memory Exhaustion from Prefork MPM Behind Varnish

In Airflow deployments using CeleryExecutor with RabbitMQ as the broker, frequent DAG parsing combined with high Celery task rates can create CPU contention on the RabbitMQ broker, especially when handling task result messages.

7mo ago▸

varnish_cache_hitpass varnish_backend_fail

Apache prefork MPM's process-per-connection model becomes memory-inefficient when fronted by Varnish, as cache misses can trigger sudden process spawn bursts that exhaust available RAM and force swapping.

Logs lost when Airflow worker dies without remote storage

ucartz.com

1y ago▸

Apache DataFusion

startdataengineering.com

2y ago▸

Webserver Crashing on Timeout During DAG Loadcritical

Webserver repeatedly crashes and restarts (503 errors) when DAG loading exceeds timeout threshold, preventing UI access even while scheduler and workers continue operating.

7y ago▸

Sensors Block Worker Slots Continuouslywarning

Sensor tasks run in poke mode occupy worker slots indefinitely while waiting for conditions, reducing available concurrency for productive work and causing task backlog.

Pipeline fan-out generates excessive concurrent warehouse tasks

7y ago▸

Snowflake