Operator CES Sync Error Cascade
criticalcilium_operator_ces_sync_errors indicates failures in synchronizing CiliumEndpointSlice resources. This breaks endpoint aggregation, causing operators to fail updating global service state and potentially leading to incomplete service load balancing across the cluster.
Monitor cilium_operator_ces_sync_errors for increments. Check cilium_operator_ces_queueing_delay_seconds_datadog to identify if sync processing is lagging. High cilium_operator_count_ceps_per_ces_datadog may indicate scalability issues with large endpoint sets.
Review Cilium operator logs for specific sync failure reasons. Verify CRD versions are compatible with operator version. Check operator CPU and memory resources. Increase operator replicas if queueing delay is high. Consider breaking up large services into smaller endpoint groups if CES size is the bottleneck.