The Temporal Cloud-Datadog integration requires configuring the Temporal Cloud Prometheus endpoint URL and uploading certificates for secure authentication. Proper certificate management is essential for maintaining metrics flow.
The official Temporal Cloud-Datadog integration eliminates the need for maintaining PromQL-to-Datadog scrapers by using a crawler-based approach that fetches metrics directly from Prometheus endpoints via REST APIs, reducing operational complexity.
The Temporal Cloud-Datadog integration classifies metrics as standard metrics rather than custom metrics in Datadog, which can significantly reduce monitoring costs compared to custom metrics ingestion.
Temporal Cloud metrics are tagged with operation type and task type attributes, enabling detailed performance analysis at the operation and task level rather than only at the system level.
Temporal Cloud metrics in Datadog are tagged with namespace attributes, enabling granular analysis and troubleshooting of multi-tenant environments. This allows isolation of issues to specific namespaces rather than investigating the entire deployment.
Replication lag indicates potential data consistency issues between Temporal Cloud regions or replicas. The integration provides built-in monitors for this critical metric to detect synchronization delays.
Workflow failures indicate potential issues with Workflow execution health and should be monitored proactively. The Datadog integration provides built-in monitors specifically for detecting Workflow failure patterns.
Temporal Cloud service latency at P99 should be monitored to detect performance degradation. The integration provides built-in monitors for this critical metric to enable proactive alerting before customer impact.
The Batcher's internal goroutine will exit after IdleTime to avoid wasting resources on idle streams. This is expected behavior but requires the goroutine to restart when new items arrive, which adds latency to the first item after idle periods.
When Add operations experience context cancellations or timeouts, the items may still be processed in the future despite the context error being returned. This creates uncertainty about item processing state and potential duplicate processing.
The Batcher system collects items into batches and processes them through a single-threaded processing function. Errors during batch processing indicate failures in the batch function execution, which can lead to unprocessed items and system degradation.
Batch operations depend on either a Query parameter or explicit Executions list to identify target workflows. Invalid queries or empty result sets will cause batch operations to complete without processing any workflows, potentially masking operational issues.
Batch operations running as long-duration activities must heartbeat regularly to signal progress. If ActivityHeartBeatTimeout is set too low relative to processing time per workflow, batch activities will timeout prematurely even when making progress.
Batch operations are rate-limited via the RPS parameter, with defaults and maximums defined by the worker.BatcherRPS dynamic configuration. When batch jobs process large numbers of workflows, RPS limits can cause significant delays or timeouts.
Different batch operation types (terminate, cancel, signal, delete, reset, update_options, unpause_activities) have distinct failure modes. Tracking operation-level errors helps identify whether failures are systemic or specific to certain operation types.
The 'Failed reaching server: last connection error' message indicates connectivity failures that commonly occur due to expired TLS certificates or during Server startup when roles are not fully initialized. This prevents Clients and Workers from establishing connections.
The 'Context: deadline exceeded' error signals that requests from Temporal Clients or Workers to the Temporal Service cannot complete within expected timeframes. This can indicate network issues, service overload, or configuration problems that prevent normal operations.
The BlobSizeLimitError occurs when payloads (including Workflow context, Activity arguments, or return values) exceed Temporal's size limits. This can cause workflow execution failures and requires immediate attention to reduce payload sizes.
The Visibility store must be properly configured as part of the Persistence store setup to enable workflow execution listing and filtering. Misconfiguration or missing Visibility store setup will prevent operators from viewing or searching workflow executions.
Temporal Server v1.21 introduces Dual Visibility capability to enable migration from one Visibility database to another without service interruption. This is critical for upgrading Visibility infrastructure or migrating from standard to advanced Visibility.
Advanced Visibility features including custom SQL-like List Filters for listing, filtering, and sorting Workflow Executions require specific database configurations. Without proper setup, operators cannot leverage enhanced query capabilities.
Standard Visibility does not support Custom Search Attributes, limiting the ability to filter and search workflow executions beyond predefined filters. This restricts operational visibility and troubleshooting capabilities.
Standard Visibility support is deprecated beginning with Temporal Server v1.21. Systems using standard Visibility need to migrate to advanced Visibility to maintain support and access to custom Search Attributes.
Without Elasticsearch integration, Temporal Service may experience performance issues when spawning more than a few Workflow Executions. Elasticsearch offloads Visibility request load from the primary database.
EWS is specifically designed for short-lived workflows that use local activities to interact with external services in the first workflow task. These workflows benefit most from reduced startup latency while maintaining reliability through server-driven retries and compensation.
EWS must be enabled at the server level using the dynamic configuration flag system.enableEagerWorkflowStart. For Temporal Cloud, this requires opening a support ticket.
EWS optimization combines workflow registration and first task assignment in a single database update, eliminating the separate matching operation that associates polling workers with task queue messages. This reduces both latency and latency variation.
The current EWS implementation does not support worker versioning with build IDs, another Pre-release feature. This incompatibility will be resolved before General Availability.
EWS requires the worker and starter to share a client connection to discover each other, meaning they must run in the same process and share a common lifecycle. This is a significant architectural constraint that differs from traditional Temporal deployment patterns.
When Eager Workflow Start is enabled but the local worker refuses to honor the reserved execution slot, the system automatically falls back to the traditional non-eager workflow start path after WorkflowTaskTimeout expires.