When the istiod control plane experiences high load or lock contention, configuration updates to Envoy proxies are delayed by minutes, causing traffic to continue routing to terminated pods and resulting in 503 errors.

istio_pilot_xds_push_timeout_failures istio_pilot_proxy_convergence_time_datadog istio_go_memstats_heap_allocated_size

oneuptime.com

github.com

medium.com

6mo ago▸

Istio

Envoy Sidecar Memory Explosion

critical

Envoy sidecar proxies consuming 2GB+ memory per pod, causing OOMKills and degraded service performance. This occurs when Istio pushes massive configurations to sidecars in large clusters or with poorly scoped routing rules.

istio_go_memstats_heap_allocated_size istio_mesh_agent_go_memstats_heap_objects istio_pilot_xds_eds_instances

site24x7.com

blog.stackademic.com

6mo ago▸

Istio

mTLS Certificate Expiration Cascadecritical

Expired or failing-to-rotate certificates cause widespread service-to-service authentication failures, resulting in complete traffic loss between services when mTLS is in STRICT mode.

istio_citadel_server_root_cert_expiry_timestamp istio_citadel_server_cert_chain_expiry_timestamp istio_citadel_server_authentication_failure+1 more

6mo ago▸

Service Mesh CPU Tax Under Load

warning

Istio proxy CPU overhead reaches 40%+ of total cluster CPU during high traffic periods, causing throttling and increased latency. Each request going through Envoy routing logic adds 50-60ms p95 latency.

istio_mesh_agent_process_cpu_seconds istio_pilot_go_gc_time_seconds_datadog istio_go_goroutines

site24x7.com

blog.stackademic.com

6mo ago▸

Istio

VirtualService Route Misconfiguration Blackholecritical

Incorrectly configured VirtualService routing rules cause traffic to be silently dropped or routed to wrong destinations. Common issues include case-sensitive mismatches, missing DestinationRules, or conflicting route priorities.

istio_pilot_xds_rds_reject istio_pilot_invalid_out_listeners istio_pilot_duplicate_envoy_clusters

6mo ago▸

Galley Endpoint Discovery Failure

critical

Galley component fails to discover pod endpoints for services, preventing Istio from routing traffic correctly. This manifests as istio_galley_endpoint_no_pod errors and results in 503 upstream connection failures.

istio_galley_endpoint_no_pod istio_galley_runtime_state_type_instances istio_galley_process_cpu_seconds

oneuptime.com

github.com

6mo ago▸

Istio

Mixer Config Resolution Errorswarning

Mixer adapter configuration errors prevent telemetry and policy enforcement from working correctly. This causes metrics gaps and policy violations to go undetected.

istio_mixer_config_adapter_info_config_errors istio_mixer_handler_handler_build_failures istio_mixer_config_resolve

oneuptime.com

6mo ago▸

Trino

ISTIO service mesh between Trino nodes causes shutdown hang

warning

github.com

1y ago▸

Istio

Istio Controller Queue Bottleneck

critical

The istiod controller queue builds up to 20K+ events during high pod churn, causing minutes of delay before endpoint updates are processed. This results in traffic being sent to terminated pods long after they've been deleted.

istio_pilot_k8s_reg_events istio_pilot_k8s_cfg_events istio_pilot_eds_no_instances+1 more

github.com

medium.com

1y ago▸