Technologies/Apache Flink/flink_jobmanager_job_downtime
Apache FlinkApache FlinkMetric

flink_jobmanager_job_downtime

For jobs currently in a failing/recovering situation- the time elapsed during this outage. Returns 0 for running jobs and -1 for completed jobs
Dimensions:None
Available on:DatadogDatadog (1)
Interface Metrics (1)
DatadogDatadog
For jobs currently in a failing/recovering situation- the time elapsed during this outage. Returns 0 for running jobs and -1 for completed jobs
Dimensions:None
Knowledge Base (5 documents, 0 chunks)
referenceMetrics and dimensions in Managed Service for Apache Flink - Managed Service for Apache Flink2465 wordsscore: 0.95Official AWS documentation for Managed Service for Apache Flink (Amazon MSF) metrics and dimensions. Provides comprehensive reference of application metrics, including CPU, memory, checkpoint, watermark, and backpressure metrics that are reported to Amazon CloudWatch for monitoring Flink applications.
troubleshootingFlink JobManager dies due to checkpoint failures - Stack Overflow762 wordsscore: 0.72This Stack Overflow discussion addresses JobManager failures caused by checkpoint exceptions after upgrading from Flink 1.9.0 to 1.11.1. The issue involves FileAlreadyExistsException when reusing fixed job IDs without HA, and discusses checkpoint recovery behavior and the necessity of HA for JobManager failure tolerance.
troubleshootingApplication is restarting - Managed Service for Apache Flink1016 wordsscore: 0.85This AWS documentation page provides troubleshooting guidance for Apache Flink applications that are repeatedly restarting or failing. It covers symptoms, root causes, and solutions for unstable applications, including how to use CloudWatch metrics like FullRestarts and Downtime to diagnose issues.
best practicesOperating Flink Is Hard: What does this really mean? And how to go about it?1627 wordsscore: 0.85This blog post provides operational best practices for running Apache Flink in production, emphasizing that Flink jobs should be treated like microservices. It covers capacity planning, performance testing, monitoring strategies, and how different teams (platform engineers vs application developers) should approach observability with different metrics and alert thresholds.
documentationOptimizing Flink job restart times for task recovery and scaling operations with Amazon EMR on EKS - Amazon EMR1722 wordsscore: 0.75This AWS documentation page describes optimization techniques for Flink job restart times during task recovery and scaling operations on Amazon EMR on EKS. It covers task-local recovery, EBS volume mounting, incremental checkpointing, and fine-grained recovery mechanisms to reduce recovery time from minutes to seconds.
Related Insights (1)
Job Restart Stormcritical

Frequent job restarts indicate systemic instability from resource exhaustion, external dependency failures, or configuration issues, compounding into extended downtime.