Technology Scenario Category Severity Gradew/ SchemaTokensLatencyTurnsTool CallsSchema CallsResponse
Snowflake
Cache Hit Ratio OptimizationI'm seeing the same Snowflake queries run multiple times per day but they're always scanning from remote storage instead of using cached results. Help me understand why the result cache isn't being used and how to optimize warehouse cache retention.
Proactive Healthinfo
PostgreSQL
Connection Pool ExhaustionWe're getting 'FATAL: sorry, too many clients already' errors in our app logs. Our Cloud SQL PostgreSQL instance is at 95% of max_connections. What do I do right now and how do I prevent this?
Incident ResponsecriticalB+A-601vs85215.9svs19.1s2vs20vs00vs01,490 charsvs2,053 chars
Redis
Cost Optimization Through Right-Sized Instance SelectionWe're running Redis on AWS ElastiCache using r6g.2xlarge instances but I suspect we might be over-provisioned. Our CPU is averaging 30% and memory is at 45%. Is there a cheaper instance type that would work for our workload, or should we consider data tiering with r6gd instances? How do I balance cost savings against performance risk?
Cost Optimizationinfo
Redis
Cross-Cloud Migration Planning and ValidationWe're migrating our Redis deployment from AWS ElastiCache to Azure Cache for Redis. I need to understand the differences between the platforms - are there metrics that are named differently or work differently? What's the best migration approach to minimize downtime and how do I validate that all data migrated correctly?
Migrationinfo
PostgreSQL
Cross-platform metric mapping for PostgreSQLI'm migrating my PostgreSQL monitoring from Datadog to Prometheus. How do Datadog's postgresql.connections metric and AWS CloudWatch's DatabaseConnections metric map to the postgres_exporter metrics in Prometheus?
Cross PlatforminfoA-B+868vs1,10918.1svs44.1s2vs80vs30vs22,078 charsvs1,818 chars
Redis
Eviction Policy Misconfiguration Causing OOM ErrorsOur Redis is throwing OOM errors saying 'command not allowed when used memory > maxmemory' even though we have maxmemory-policy set to volatile-lru. We're not at 100% memory but writes are being rejected. Why isn't Redis evicting keys and how do I fix this without losing data?
Incident Responsecritical
Redis
High Latency Spikes from Slow CommandsWe're seeing Redis latency spike to 500ms+ randomly throughout the day, causing timeouts in our application. Our P99 latency is normally under 10ms. I need to figure out what's causing these spikes - is it slow commands, memory issues, or something with persistence? What should I look at in SLOWLOG and what other metrics correlate with these latency events?
Incident Responsewarning
Redis
Memory Capacity Planning for Growing WorkloadOur Redis memory usage has grown from 60% to 82% of maxmemory over the past month and the trend is continuing. I want to understand if this growth rate means we'll hit our limit soon and start evicting keys. Should I scale up the instance now or can I wait? What's the safe memory threshold to stay under?
Capacity Planningwarning
Redis
Memory Eviction Causing Database Thundering HerdMy Redis cache is evicting keys rapidly and our database is getting hammered with queries. I'm seeing 10x normal database load and response times are terrible. Help me diagnose if this is a memory pressure issue in Redis and whether I need to scale up or tune eviction policies.
Incident Responsecritical
Redis
Memory Fragmentation RemediationMy Redis instance shows 6GB used memory but the actual RSS is 9GB, giving a fragmentation ratio of 1.5. We're getting closer to our memory limit but a lot of it seems to be wasted. Should I enable active defragmentation or restart the instance? What's causing this and how do I prevent it from happening again?
Proactive Healthwarning
Redis
Redis Backup and Disaster Recovery VerificationOur Redis backups show rdb_last_save_time from 6 hours ago but we configured hourly snapshots - help me verify if backups are actually running correctly, whether AOF is healthy, and how much data we'd lose if Redis crashed right now.
Proactive Healthwarning
Redis
Redis Blocked Clients from Blocking OperationsI'm seeing blocked_clients metric climbing to 500+ and we use BLPOP heavily for job queues - help me understand if this is normal for our workload or if blocked operations are consuming too many resources and impacting Redis performance.
Proactive Healthwarning
Redis
Redis Cluster Failover and Split-Brain DiagnosisOur Redis Sentinel is showing conflicting master nodes after a network partition. Some sentinels think node A is master while others think node B is master. We're getting write conflicts and I'm worried about data loss. How do I diagnose which node should actually be master and safely resolve this split-brain situation?
Incident Responsecritical
Redis
Redis Cluster Resharding for Horizontal ScalingWe need to add 3 more nodes to our Redis cluster to handle growth, but I'm worried about the resharding process causing latency spikes or downtime - help me understand how long slot migration will take, whether it can run during business hours, and if clients need any changes to handle the resharding.
Capacity Planningwarning
Redis
Redis Connection Pool Exhaustion - Max Clients ReachedWe're getting 'ERR max number of clients reached' errors and our app can't connect to Redis - should I increase maxclients, fix connection leaks in the application, or scale out the Redis cluster?
Incident Responsecritical
Redis
Redis CPU Saturation and Slowlog SpikeRedis EngineCPU is pegged at 100% and the slowlog is filling up with queries taking over 500ms - help me identify which commands are blocking the single-threaded engine and whether I need to optimize queries or scale vertically.
Incident Responsecritical
Redis
Redis Failover Decision - Manual vs AutomaticOur Redis master is degraded with high latency but Sentinel hasn't triggered failover yet - help me understand if I should manually promote a replica with FAILOVER command or wait for automatic failover, and what's the data loss risk either way?
Incident Responsecritical
Redis
Redis Instance Right-Sizing for Cost and PerformanceI'm running Redis on ElastiCache r6g.large but CPU is consistently under 20% and memory usage is only 40% - help me determine if I should downsize to save costs, or if there are traffic patterns or growth projections I should consider before changing instance size.
Cost Optimizationinfo
Redis
Redis Memory Eviction Crisis During Peak TrafficMy Redis cache is evicting keys like crazy and the cache hit rate just dropped from 95% to 60% - help me figure out if I need to scale up, tune eviction policies, or if there's a memory leak in the application.
Incident Responsecritical
Redis
Redis Memory Fragmentation Degrading PerformanceMy Redis instance shows mem_fragmentation_ratio at 1.8 even though we're only using 60% of allocated memory - is this fragmentation normal or do I need to enable active defragmentation, and will it impact production performance?
Proactive Healthwarning
Redis
Redis Persistence Strategy Selection - RDB vs AOFWe're currently using RDB snapshots every 5 minutes but just lost 3 minutes of writes during a crash - should we switch to AOF for better durability, enable hybrid persistence, or is there a way to tune RDB to reduce data loss without killing performance?
Proactive Healthwarning
Redis
Redis Replication Lag Spike Causing Stale ReadsOur Redis replicas are showing 30+ seconds of replication lag and we're getting complaints about stale data - help me diagnose if this is a network issue, resource constraint, or if the master is writing too fast for replicas to keep up.
Incident Responsewarning
Redis
Replication Lag Causing Stale Reads and Failover RiskOur Redis replicas are showing replication lag of 30+ seconds and we're getting complaints about stale data being served to users. I'm worried about data loss if we have a failover right now. What metrics should I check to understand why replication is lagging and how can I determine if we need to scale or if this is a temporary spike?
Incident Responsecritical
Redis
SLOWLOG Analysis for Performance OptimizationOur Redis P95 latency has crept up from 5ms to 25ms over the past few weeks. I want to check SLOWLOG to see if there are specific commands causing problems. What should I look for in SLOWLOG output and what are common culprits? How do I determine if these slow commands are from our application or if Redis itself is struggling?
Proactive Healthwarning