Unacked mutex expiration triggers cascading worker failures every 5 minutes

critical

availabilityUpdated Nov 2, 2023(via Exa)

Sources

Celery worker stuck with unack mutex and high network usage with Kombu & Redis · Issue #1816 · celery/kombugithub.com

Technologies:

Celerysubject

RedisRedis metrics correlate with this issue and help confirm diagnosis

How to detect:

When unacked_mutex expires (default 5 minutes), a new worker attempts to restore stuck unacked tasks and becomes stuck itself. This creates a cascading failure pattern where one additional worker fails every 5 minutes. Pattern correlates with zrevrangebyscore command execution every 5 minutes and high hget command usage.

Recommended action:

Increase unacked_mutex TTL from default 5 minutes to several hours as temporary mitigation to slow failure propagation. Monitor Redis commands: zrevrangebyscore executed once per 5 minutes indicates mutex expiration cycle, excessive hget usage indicates stuck restoration attempts. Set visibility timeout high enough to prevent premature task restoration. Investigate and remove root cause tasks from unacked queue.