Memory Leak in MLflow Server Process
criticalMLflow tracking server exhibits unbounded memory growth over time, reaching 8-20GB RAM consumption even with moderate experiment counts (10,000+). Memory growth correlates with cached experiment metadata, model registry data, and internal state that is never evicted, eventually leading to OOM kills and service disruption.
MLflow server process memory growing >100MB per day continuously, memory usage exceeding 8GB with <20,000 experiments, OOM kills visible in system logs or container restart events, memory growth rate correlating with new experiment creation rate
1. INVESTIGATE: Monitor MLflow server process RSS memory over 24-48 hours. Check for correlation between memory growth and experiment creation. Review server logs for cache-related warnings. 2. DIAGNOSE: Use memory profiler (py-spy, memory_profiler) to identify largest objects in heap. Check if metadata caching is growing unbounded. Verify no circular references preventing garbage collection. 3. REMEDIATE: Run MLflow server in containerized environment with memory limits and automatic restarts: 'docker run --memory=4g --memory-swap=4g --restart=unless-stopped mlflow/mlflow:latest server ...'. Schedule periodic server restarts every 24-48 hours via cron or container orchestration. For Kubernetes, set memory limits and liveness probes with restart policy. Consider deploying multiple server replicas behind load balancer to enable zero-downtime restarts. 4. PREVENT: Monitor server memory continuously with alerts at 3GB, 5GB, 7GB thresholds. Track upstream MLflow GitHub issues for memory leak fixes. Test new versions in staging before upgrading production. Document memory growth patterns and restart frequency.