cilium_operator_ces_sync_errors indicates failures in synchronizing CiliumEndpointSlice resources. This breaks endpoint aggregation, causing operators to fail updating global service state and potentially leading to incomplete service load balancing across the cluster.
Elevated cilium_k8s_client_rate_limiter_time_seconds indicates Cilium agents are being throttled by Kubernetes API server rate limits. This delays reaction to cluster state changes, causing stale service endpoints, delayed policy enforcement, and slow pod networking setup.
cilium_proxy_datapath_update_timeout increments when Envoy proxy configuration updates fail to apply within the timeout window. This causes L7 policy enforcement failures and can result in traffic being dropped or misrouted at the application layer.
High cilium_fqdn_active_names or cilium_fqdn_active_ips counts indicate DNS-based policy consuming significant memory. When combined with low cilium_fqdn_gc_deletions_datadog, stale entries accumulate, causing endpoint regeneration delays and potential OOM conditions.
When cilium_operator_ipam_needed_ips exceeds cilium_operator_ipam_ips, pods cannot be scheduled due to IP exhaustion. This is exacerbated by high cilium_operator_ipam_deficit_resolver_time_seconds, indicating the operator is struggling to provision new IPs from the cloud provider.
When policy regeneration events accumulate faster than they can be processed, Cilium folds multiple updates into single operations. High fold counts indicate policy churn overwhelming the agent, causing delayed enforcement and potential security gaps.
When Hubble's flow buffer reaches 100% capacity, new flow events are dropped, creating observability blind spots. This occurs during traffic spikes or when flows/s exceeds buffer capacity, preventing accurate troubleshooting and policy validation.
When cilium_kvstore_quorum_errors_datadog increments, the cluster has lost consensus with the backing KVStore (etcd/consul). This prevents policy propagation, service discovery updates, and can cause cluster-wide connectivity failures as agents cannot sync state.
Elevated BPF map operation times indicate kernel datapath contention or CPU pressure affecting packet processing. This manifests as increased connection establishment latency and reduced throughput, particularly impacting high-connection-rate workloads.
When endpoints on remote nodes fail to respond to ICMP/HTTP health probes, but host-level connectivity succeeds, this indicates datapath or policy issues preventing traffic from reaching the endpoint namespace. This pattern isolates the failure to pod networking rather than node-to-node connectivity.