Dramatiq

TimeLimitExceeded exception during error handling causes worker thread death

critical
availabilityUpdated Nov 8, 2021(via Exa)
Technologies:
How to detect:

When an actor with time_limit exceeds its limit while in a blocking syscall and then raises an exception, a race condition causes TimeLimitExceeded to fire during broker.emit_after exception handling. Because TimeLimitExceeded inherits from BaseException (not Exception), it escapes exception handlers, killing the worker thread and causing the worker to hang with decreased alive threads. Confirmed unfixed in versions 1.11.0, 1.14.2, and 1.17.0.

Recommended action:

Use gevent instead of threading for time limits (recommended by maintainer) for more precise timeouts that don't rely on async exceptions. Alternatively, implement a watchdog to detect and restart dead threads. Monitor dramatiq.worker.threads metric for decreases indicating thread death. Avoid setting time_limit on actors that perform long blocking syscalls without gevent.