MLflow

Artifact Storage Cost Explosion

critical
cost_managementUpdated Sep 3, 2025

S3/cloud storage costs explode due to MLflow storing all artifacts forever by default, including large training datasets and intermediate model checkpoints logged with every experiment.

How to detect:

Monitor artifact storage size growth, S3 bucket costs, and average artifact size per experiment. Watch for experiments logging full training datasets or large model files repeatedly.

Recommended action:

Implement S3 lifecycle policies to transition artifacts to cheaper storage tiers (STANDARD_IA after 30 days, GLACIER after 365 days, expire after 7 years). Log artifact URIs instead of uploading large files directly through MLflow. Set MLFLOW_ARTIFACT_UPLOAD_TIMEOUT=300 for large artifacts.