Major page faults (process_major_page_faults_total) increase, indicating the OS is swapping vector data to disk and severely degrading query performance.
Snapshot recovery operations (snapshot_recovery_running) block shard activation during pod restarts, extending downtime and reducing cluster availability during deployments.
Running optimization tasks (collection_running_optimizations) remain high or increase over time, indicating segment merging cannot keep pace with write throughput.
Cluster consensus operations (cluster_pending_operations_total) accumulate without completion, indicating network partitions or peer failures degrading cluster coordination.
Persistent volume usage (kubelet_volume_stats_used_bytes) approaches capacity while collections continue growing, risking write failures and cluster instability.
Container working set memory approaches or exceeds configured limits, causing OOM kills or evictions that disrupt vector search availability.
Non-active replicas (collection_dead_replicas) approach or exceed the configured replication_factor minus write_consistency_factor, risking write failures and data availability.
Container CPU throttling increases when vector search query volume spikes, indicating the cluster is CPU-constrained and unable to meet query demand efficiently.
REST and gRPC query durations (rest_responses_duration_seconds, grpc_responses_duration_seconds) increase when filters are applied, indicating unoptimized payload indexing or missing indexes.