RestLI Server Error Rate Spike API Reliability

critical

reliabilityUpdated Feb 23, 2026

DataHub backend API experiencing elevated error rates impacting metadata ingestion, UI operations, and external integrations, potentially indicating service degradation or infrastructure issues.

Sources

Monitoring DataHubdocs.datahub.com

Technologies:

DataHubSymptoms of this issue are visible in DataHub metrics and logs

How to detect:

Monitor restli_server_error rate and http.server.request.duration error responses (5xx status codes). Alert when error rate exceeds baseline (e.g., >1% of requests) or when specific critical endpoints show elevated errors. Correlate with http.server.active_requests to identify if errors occur during high load or are consistent across load levels.

Recommended action:

Review DataHub GMS logs for stack traces and error details. Identify if errors are concentrated on specific API endpoints (metadata ingestion, search, lineage queries) or distributed. Check dependencies: Elasticsearch cluster health (connection failures), Kafka broker availability (metadata event processing), MySQL/PostgreSQL database connection pool exhaustion. Monitor process.cpu.utilization and system_cpu_usage to rule out CPU saturation. Scale GMS pods if request rate (http_server_requested) exceeds capacity. Implement circuit breakers for downstream dependencies to prevent cascade failures.