RestLI Server Error Rate Spike API Reliability
criticalDataHub backend API experiencing elevated error rates impacting metadata ingestion, UI operations, and external integrations, potentially indicating service degradation or infrastructure issues.
Monitor restli_server_error rate and http.server.request.duration error responses (5xx status codes). Alert when error rate exceeds baseline (e.g., >1% of requests) or when specific critical endpoints show elevated errors. Correlate with http.server.active_requests to identify if errors occur during high load or are consistent across load levels.
Review DataHub GMS logs for stack traces and error details. Identify if errors are concentrated on specific API endpoints (metadata ingestion, search, lineage queries) or distributed. Check dependencies: Elasticsearch cluster health (connection failures), Kafka broker availability (metadata event processing), MySQL/PostgreSQL database connection pool exhaustion. Monitor process.cpu.utilization and system_cpu_usage to rule out CPU saturation. Scale GMS pods if request rate (http_server_requested) exceeds capacity. Implement circuit breakers for downstream dependencies to prevent cascade failures.