Apache Pulsar

Replication Rate Mismatch Signals Cross-Region Lag or Failure

critical
ReplicationUpdated Apr 24, 2024

Divergence between pulsar_replication_rate_in and pulsar_replication_rate_out, or sustained pulsar_replication_rate_expired messages, indicates geo-replication is failing or falling behind, risking data loss in disaster recovery scenarios.

How to detect:

Compare pulsar_replication_rate_in at target cluster with pulsar_replication_rate_out at source cluster to detect replication lag. Monitor pulsar_replication_rate_expired for messages expiring before replication completes. Check pulsar_replication_connected status for broken replication links. Correlate with pulsar_replication_throughput_out to assess bandwidth utilization.

Recommended action:

Investigate network connectivity between regions and check for bandwidth saturation. Verify replication policies and ensure sufficient resources (bandwidth, broker capacity) at target cluster. Check for authentication/authorization failures blocking replication. Consider increasing message TTL if legitimate replication lag is expected. Scale broker and BookKeeper capacity at target cluster if inbound replication is bottlenecked.