Millisecond to second conversion bug causes 1000x longer retry delays

critical

configurationUpdated Sep 26, 2024(via Exa)

Sources

Some messages not being processed · Issue #652 · Bogdanp/dramatiqgithub.com

Technologies:

Dramatiqsubject

Apache KafkaApache Kafka metrics correlate with this issue and help confirm diagnosis

How to detect:

A bug in the Kafka provider's retry delay logic fails to convert milliseconds to seconds before time.sleep(), causing retry delays to be 1000 times longer than intended. This blocks workers for extended periods, leading to pool exhaustion.

Recommended action:

Check Kafka provider code for time.sleep() calls with delay parameters. Verify time unit conversion between milliseconds and seconds is performed. Apply fix to convert ms to seconds before time.sleep. Review retry delay configuration to ensure values are in expected units. Monitor dramatiq.messages.retried metric for abnormal patterns.