Istio Controller Queue Bottleneck
criticalThe istiod controller queue builds up to 20K+ events during high pod churn, causing minutes of delay before endpoint updates are processed. This results in traffic being sent to terminated pods long after they've been deleted.
Monitor istiod logs for controller queue depth and processing times. Look for istio_pilot_k8s_reg_events and istio_pilot_k8s_cfg_events growing rapidly. Check istio_pilot_eds_no_instances for services with no healthy endpoints. If processing time per EndpointSlice update exceeds 800ms or queue length exceeds 1000, the controller is overwhelmed.
Reduce the rate of pod churn by implementing deployment strategies like canary or blue-green instead of rolling restarts. Increase istiod CPU and memory limits to handle queue processing faster. Consider splitting large clusters into multiple meshes. Upgrade to Istio versions with optimized controller queue handling. Monitor istio_pilot_push_triggers_datadog to identify what's causing excessive pushes.