Dashboard agent on head node becomes unavailable (connection refused on port 52365), preventing Serve application status checks and creating cascading failures in RayService management.
Major page faults (process_major_page_faults_total) increase, indicating the OS is swapping vector data to disk and severely degrading query performance.
Snapshot recovery operations (snapshot_recovery_running) block shard activation during pod restarts, extending downtime and reducing cluster availability during deployments.
Cluster consensus operations (cluster_pending_operations_total) accumulate without completion, indicating network partitions or peer failures degrading cluster coordination.
Persistent volume usage (kubelet_volume_stats_used_bytes) approaches capacity while collections continue growing, risking write failures and cluster instability.
Container working set memory approaches or exceeds configured limits, causing OOM kills or evictions that disrupt vector search availability.
Non-active replicas (collection_dead_replicas) approach or exceed the configured replication_factor minus write_consistency_factor, risking write failures and data availability.
Container CPU throttling increases when vector search query volume spikes, indicating the cluster is CPU-constrained and unable to meet query demand efficiently.
Time differences exceeding 5 minutes between control plane and cluster nodes cause TLS validation failures, as nodes may incorrectly determine certificates are expired or not yet valid.
The Linkerd tap service may fail to respond to live traffic inspection requests, eliminating critical debugging visibility. Often caused by tap pod crashes, network policy blocks, or resource constraints.
Self-hosted LangSmith instances experience infrastructure issues (disk space, resource constraints, pod failures) that are not detected through application-level monitoring alone.
Connection failures and network errors spike during rolling deployments or node failures when pod IPs change faster than service mesh or load balancer updates can propagate. Downstream services continue sending traffic to terminating pods or stale endpoints.
Elevated apiserver_request_duration_seconds and apiserver_request_total errors indicate API server overload or scheduler bottlenecks, causing slow pod scheduling, kubectl timeouts, and degraded cluster responsiveness.
CrewAI web pods experience OOMKilled restarts when memory limits are insufficient for concurrent agent workloads, especially with high WEB_CONCURRENCY and RAILS_MAX_THREADS settings, causing service disruptions.
CrewAI crew builds fail when BuildKit cannot authenticate to container registries, causing silent build failures and preventing crew deployment updates. Network connectivity issues or missing registry secrets compound the problem.
CoreDNS running on shared CPU instances experiences intermittent slowness due to CPU steal, particularly during neighbor workload spikes, causing unpredictable DNS latency that's difficult to attribute.
cilium_operator_ces_sync_errors indicates failures in synchronizing CiliumEndpointSlice resources. This breaks endpoint aggregation, causing operators to fail updating global service state and potentially leading to incomplete service load balancing across the cluster.
Elevated cilium_k8s_client_rate_limiter_time_seconds indicates Cilium agents are being throttled by Kubernetes API server rate limits. This delays reaction to cluster state changes, causing stale service endpoints, delayed policy enforcement, and slow pod networking setup.
When policy regeneration events accumulate faster than they can be processed, Cilium folds multiple updates into single operations. High fold counts indicate policy churn overwhelming the agent, causing delayed enforcement and potential security gaps.
AWS CNI enforces low max pod counts (e.g., 58 for c6g.2xlarge) on instances with fewer CPUs due to ENI limitations, preventing full utilization of node compute capacity unless prefix delegation is enabled.
When all EKS-managed node groups using a shared instance profile are deleted, AWS removes the instance profile role from aws-auth ConfigMap, breaking Cast AI-managed nodes that rely on the same role for kubelet authentication to the API server.