Snapshot Lifecycle Policy Failures

critical

reliabilityUpdated Mar 2, 2026

Failed snapshot operations risk data loss and violate backup SLAs. SLM policy failures can result from repository unavailability, insufficient permissions, or storage quota issues.

Technologies:

Elasticsearchsubject

elasticsearch.slm

elasticsearch.snapshot

elasticsearch.cluster.shards

How to detect:

elasticsearch.slm or elasticsearch.snapshot showing failed operations, or snapshot operations not completing within expected timeframes

Recommended action:

Investigate via _slm/policy and _snapshot repository status APIs. Check: (1) Repository accessibility - verify network connectivity and credentials for S3/Azure/GCS, (2) Storage quota - ensure sufficient space in snapshot repository, (3) Cluster state - red/yellow status blocks snapshots, resolve via elasticsearch.cluster.health, (4) Snapshot conflicts - only one snapshot per repository allowed concurrently. Review SLM policy configuration for retention and scheduling conflicts. Monitor snapshot duration trends - increasing times indicate growing data volume or storage performance issues. Verify snapshot restore capability periodically via test restores to different cluster.