Failed Shard Causing Distributed Query Timeouts

critical

reliabilityUpdated Jan 5, 2021

A single failed or unresponsive shard in a distributed collection causes cascading timeouts across the cluster as queries attempt to reach all shards, eventually making other nodes unresponsive even though they are healthy.

Sources

A Solr performance debugging toolkit - OpenSource Connectionsopensourceconnections.com

Technologies:

Apache SolrThe root cause of this issue originates in Apache Solr

How to detect:

Monitor for increasing timeout rates across multiple nodes. Check for HTTP errors or connection failures to specific shards. Look for asymmetric error rates - one shard showing errors while others are healthy indicates shard-specific failure.

Recommended action:

Identify the failed shard via admin API and logs. Isolate or restart the problematic shard. In SolrCloud, trigger replica recovery or failover. For secondary collections (like spellcheck), consider temporarily disabling the feature until recovery completes.