Unacked mutex expiration triggers cascading worker failures every 5 minutes
criticalavailabilityUpdated Nov 2, 2023(via Exa)
How to detect:
When unacked_mutex expires (default 5 minutes), a new worker attempts to restore stuck unacked tasks and becomes stuck itself. This creates a cascading failure pattern where one additional worker fails every 5 minutes. Pattern correlates with zrevrangebyscore command execution every 5 minutes and high hget command usage.
Recommended action:
Increase unacked_mutex TTL from default 5 minutes to several hours as temporary mitigation to slow failure propagation. Monitor Redis commands: zrevrangebyscore executed once per 5 minutes indicates mutex expiration cycle, excessive hget usage indicates stuck restoration attempts. Set visibility timeout high enough to prevent premature task restoration. Investigate and remove root cause tasks from unacked queue.