flink_jobmanager_job_uptime
The time that the job has been running without interruption. Returns -1 for completed jobsDimensions:None
Available on:
Datadog (1)
Interface Metrics (1)
Dimensions:None
Knowledge Base (3 documents, 0 chunks)
best practicesOperating Flink Is Hard: What does this really mean? And how to go about it?1627 wordsscore: 0.85This blog post provides operational best practices for running Apache Flink in production, emphasizing that Flink jobs should be treated like microservices. It covers capacity planning, performance testing, monitoring strategies, and how different teams (platform engineers vs application developers) should approach observability with different metrics and alert thresholds.
documentationOptimizing Flink job restart times for task recovery and scaling operations with Amazon EMR on EKS - Amazon EMR1722 wordsscore: 0.75This AWS documentation page describes optimization techniques for Flink job restart times during task recovery and scaling operations on Amazon EMR on EKS. It covers task-local recovery, EBS volume mounting, incremental checkpointing, and fine-grained recovery mechanisms to reduce recovery time from minutes to seconds.
best practicesMonitoring Apache Flink Applications 1012666 wordsscore: 0.95This blog post provides comprehensive guidance on monitoring Apache Flink applications, covering the built-in metrics system, key metrics to track for health, progress, throughput, and latency. It includes specific metric names, dashboard examples, and recommended alerts for production deployments.
Related Insights (1)
Job Restart Stormcritical
Frequent job restarts indicate systemic instability from resource exhaustion, external dependency failures, or configuration issues, compounding into extended downtime.
▸