Istio Controller Queue Bottleneck

critical

reliabilityUpdated Oct 18, 2024

The istiod controller queue builds up to 20K+ events during high pod churn, causing minutes of delay before endpoint updates are processed. This results in traffic being sent to terminated pods long after they've been deleted.

Sources

Troubleshooting Istio - GitHubgithub.com

Enhancing Istio’s Performance: Unveiling the Impact of Lock Contention on Expedia Group’s Compute Platformmedium.com

Technologies:

IstioThe root cause of this issue originates in Istio

KubernetesKubernetes metrics correlate with this issue and help confirm diagnosis

How to detect:

Monitor istiod logs for controller queue depth and processing times. Look for istio_pilot_k8s_reg_events and istio_pilot_k8s_cfg_events growing rapidly. Check istio_pilot_eds_no_instances for services with no healthy endpoints. If processing time per EndpointSlice update exceeds 800ms or queue length exceeds 1000, the controller is overwhelmed.

Recommended action:

Reduce the rate of pod churn by implementing deployment strategies like canary or blue-green instead of rolling restarts. Increase istiod CPU and memory limits to handle queue processing faster. Consider splitting large clusters into multiple meshes. Upgrade to Istio versions with optimized controller queue handling. Monitor istio_pilot_push_triggers_datadog to identify what's causing excessive pushes.