Artifact Storage Cost Explosion
criticalcost_managementUpdated Sep 3, 2025
S3/cloud storage costs explode due to MLflow storing all artifacts forever by default, including large training datasets and intermediate model checkpoints logged with every experiment.
Technologies:
How to detect:
Monitor artifact storage size growth, S3 bucket costs, and average artifact size per experiment. Watch for experiments logging full training datasets or large model files repeatedly.
Recommended action:
Implement S3 lifecycle policies to transition artifacts to cheaper storage tiers (STANDARD_IA after 30 days, GLACIER after 365 days, expire after 7 years). Log artifact URIs instead of uploading large files directly through MLflow. Set MLFLOW_ARTIFACT_UPLOAD_TIMEOUT=300 for large artifacts.