Self-Hosted Deployment Health Blind Spots

critical

reliabilityUpdated Feb 23, 2026

Self-hosted LangSmith instances experience infrastructure issues (disk space, resource constraints, pod failures) that are not detected through application-level monitoring alone.

Sources

Troubleshooting - Docs by LangChaindocs.langchain.com

Troubleshooting for self-hosted deployments - Docs by LangChaindocs.langchain.com

Technologies:

LangSmithSymptoms of this issue are visible in LangSmith metrics and logs

KubernetesThe root cause of this issue originates in Kubernetes

ClickHouseThe root cause of this issue originates in ClickHouse

How to detect:

Monitor Kubernetes events, pod status, and ClickHouse disk usage for self-hosted deployments. Track pending_runs metric for queue buildup. Look for ERROR logs in langsmith-backend, langsmith-platform-backend, and langsmith-queue services.

Recommended action:

Implement comprehensive infrastructure monitoring using kubectl describe and log analysis. Set up alerts for ClickHouse disk space exhaustion (NOT_ENOUGH_SPACE errors), failed liveness/readiness probes, and image pull errors. Monitor queue depth via pending_runs metric to detect ingestion bottlenecks.