Kubernetes insights

Airflow health check fails to detect Celery worker queue consumer loss

snyk.io

3mo ago▸

Celery

Django QuerySet full result caching causes OOM in batch processing

celery.worker.up celery.queue.consumers celery.task.failed+1 more

github.com

3mo ago▸

Django

process.runtime.cpython.memory

levelup.gitconnected.com

4mo ago▸

BentoML

Service readiness determined by startup lifecycle completion

deepwiki.com

4mo ago▸

Envoy sidecar OOM kill causes connection drops and service outages

envoy.server.memory.allocated envoy.server.memory.heap

4mo ago▸

Sidecar memory usage exceeds 85% of limit

4mo ago▸

Predicted OOM kill within 2 hours based on memory growth trend

4mo ago▸

Previous OOM kill detected in container history

4mo ago▸

Insufficient CPU limits cause sidecar throttling

envoy.appmesh.cpu.utilization

Ray Dashboard Agent Unavailability Cascade

4mo ago▸

Ray

ray_component_cpu_percentage ray_process_resident_memory ray_grpc_server_requested_finished

Dashboard agent on head node becomes unavailable (connection refused on port 52365), preventing Serve application status checks and creating cascading failures in RayService management.

docs.ray.io

4mo ago▸

Page Faults Indicate Memory Swapping

Major page faults (process_major_page_faults_total) increase, indicating the OS is swapping vector data to disk and severely degrading query performance.

4mo ago▸

Snapshot Recovery Delays Cluster Startup

info

Snapshot recovery operations (snapshot_recovery_running) block shard activation during pod restarts, extending downtime and reducing cluster availability during deployments.

4mo ago▸

Consensus Lag from Pending Operations

Cluster consensus operations (cluster_pending_operations_total) accumulate without completion, indicating network partitions or peer failures degrading cluster coordination.

4mo ago▸

Disk Saturation from Vector Growth

Persistent volume usage (kubelet_volume_stats_used_bytes) approaches capacity while collections continue growing, risking write failures and cluster instability.

4mo ago▸

Memory Pressure from Working Set Growth

Container working set memory approaches or exceeds configured limits, causing OOM kills or evictions that disrupt vector search availability.

4mo ago▸

Dead Replica Quorum Risk

Non-active replicas (collection_dead_replicas) approach or exceed the configured replication_factor minus write_consistency_factor, risking write failures and data availability.

4mo ago▸

CPU Throttling Under Query Load

Container CPU throttling increases when vector search query volume spikes, indicating the cluster is CPU-constrained and unable to meet query demand efficiently.

Clock Skew Breaking TLS Certificate Validation

4mo ago▸

Linkerd

Tap Service Unresponsive Blocking Traffic Inspection

Time differences exceeding 5 minutes between control plane and cluster nodes cause TLS validation failures, as nodes may incorrectly determine certificates are expired or not yet valid.

linkerd_process_start_time linkerd_control_response_timeLinkerd Native

developers.openai.com

4mo ago▸

Linkerd

linkerd_prometheus_health linkerd_process_resident_memory

The Linkerd tap service may fail to respond to live traffic inspection requests, eliminating critical debugging visibility. Often caused by tap pod crashes, network policy blocks, or resource constraints.

Linkerd Prometheus

drdroid.io

4mo ago▸

LangSmith

Self-Hosted Deployment Health Blind Spots

Network Error Rate Spike During Pod Churnwarning

Self-hosted LangSmith instances experience infrastructure issues (disk space, resource constraints, pod failures) that are not detected through application-level monitoring alone.

pending_runs

docs.langchain.com

4mo ago▸

Kubernetes

Connection failures and network errors spike during rolling deployments or node failures when pod IPs change faster than service mesh or load balancer updates can propagate. Downstream services continue sending traffic to terminating pods or stale endpoints.

kubernetes_network_errors

cockroachlabs.com

4mo ago▸

Google GKE

High Latency from Slow API Server or Scheduler

Agent Memory Usage Spike and OOMKill

Elevated apiserver_request_duration_seconds and apiserver_request_total errors indicate API server overload or scheduler bottlenecks, causing slow pod scheduling, kubectl timeouts, and degraded cluster responsiveness.

docs.cloud.google.com

4mo ago▸

CrewAI

enterprise-docs.crewai.com

CrewAI web pods experience OOMKilled restarts when memory limits are insufficient for concurrent agent workloads, especially with high WEB_CONCURRENCY and RAILS_MAX_THREADS settings, causing service disruptions.

4mo ago▸

CrewAI

BuildKit Registry Authentication Failures

enterprise-docs.crewai.com

CrewAI crew builds fail when BuildKit cannot authenticate to container registries, causing silent build failures and preventing crew deployment updates. Network connectivity issues or missing registry secrets compound the problem.

4mo ago▸

CoreDNS

Shared CPU Droplet Impact on CoreDNS Stability

coredns_request_time_seconds coredns_process_cpu_seconds

CoreDNS running on shared CPU instances experiences intermittent slowness due to CPU steal, particularly during neighbor workload spikes, causing unpredictable DNS latency that's difficult to attribute.

docs.digitalocean.com

4mo ago▸

Cilium

Operator CES Sync Error Cascade

K8s Client Rate Limiter Throttling

cilium_operator_ces_sync_errors indicates failures in synchronizing CiliumEndpointSlice resources. This breaks endpoint aggregation, causing operators to fail updating global service state and potentially leading to incomplete service load balancing across the cluster.

cilium_operator_ces_sync_errors cilium_operator_ces_queueing_delay_seconds_datadog Datadog

Cilium Datadog

docs.cilium.io

4mo ago▸

Cilium

Policy Update Fold Efficiency Degradation

Elevated cilium_k8s_client_rate_limiter_time_seconds indicates Cilium agents are being throttled by Kubernetes API server rate limits. This delays reaction to cluster state changes, causing stale service endpoints, delayed policy enforcement, and slow pod networking setup.

cilium_k8s_client_rate_limiter_time_seconds_datadog cilium_k8s_workqueue_retries Datadog

Cilium Datadog

docs.cilium.io

4mo ago▸

Cilium

cilium_triggers_policy_update_folds cilium_policy_regeneration_time_stats_seconds_datadog cilium_policy_change_datadog Datadog

When policy regeneration events accumulate faster than they can be processed, Cilium folds multiple updates into single operations. High fold counts indicate policy churn overwhelming the agent, causing delayed enforcement and potential security gaps.

Cilium Datadog

docs.cilium.io

4mo ago▸

Cast.ai

AWS ENI Limits Constrain Max Pod Density on Smaller Instance Types

AWS Instance Profile Role Removed from aws-auth Breaks Node Registration

AWS CNI enforces low max pod counts (e.g., 58 for c6g.2xlarge) on instances with fewer CPUs due to ENI limitations, preventing full utilization of node compute capacity unless prefix delegation is enabled.

docs.cast.ai

4mo ago▸

Cast.ai