Elasticsearch insights
Open SourceVersions: [8.11.0]111 metricsDataHub metadata search results and lineage views showing stale information because Elasticsearch indices are not being updated timely, impacting data discovery and incident response.
Slow storage write operations block collector workers, causing span reception to slow and queues to back up, ultimately leading to dropped traces.
High P95/P99 query latencies (>5 seconds) make Jaeger UI unusable during incident troubleshooting, typically caused by slow storage reads, overloaded shards, or inefficient trace queries.
When JVM heap usage stays above 85% for extended periods, garbage collection pauses increase dramatically, leading to node unresponsiveness, cluster state propagation failures, and potential split-brain scenarios.
When any node crosses the low disk watermark (85% full by default), Elasticsearch starts relocating shards. Multiple nodes hitting watermarks simultaneously can trigger cascading relocations that overload cluster I/O and delay recovery.
Requests for high-offset results (e.g., page 500 with OFFSET 5000) force every shard to fetch and score all preceding documents before discarding them, causing coordinating node memory pressure and exponential latency growth beyond ~10K results.
Expensive queries or indexing operations monopolize thread pool workers, causing benign requests to queue indefinitely. Manifests as stuck tasks in _cat/tasks with millisecond operations taking minutes while thread pools show 100% utilization.
When primary shards cannot be assigned (elasticsearch.cluster.health == 2), data becomes unavailable and cluster enters red state. This occurs from insufficient nodes, misconfigured shard allocation rules, or node failures during insufficient replica coverage.
Refresh operations make indexed documents searchable by writing in-memory buffer to disk segments. Default 1-second interval creates overhead that scales with indexing rate. When refresh time exceeds interval, indexing throughput collapses and latency spikes.
Background segment merges consolidate small Lucene segments into larger ones, reducing file count but consuming I/O. Default throttling (20MB/s) prevents merge backlog from overwhelming cluster, but excessive throttling causes segment explosion and query performance degradation.
Field data (inverted reverse index for aggregations) loads into JVM heap on first access and persists for segment lifetime. When circuit breaker limit or cache size is too small, frequent evictions cause repeated expensive field data loading, spiking CPU and heap pressure.