Artifact Upload Timeout Cascade Failures
warninglatencyUpdated Sep 3, 2025
Large model artifacts (>2GB) consistently timeout during upload to S3 with default timeout settings, causing training pipelines to fail at the final step after hours of computation.
Technologies:
How to detect:
Monitor artifact upload durations, timeout error rates, artifact sizes, and network throughput. Watch for 'requests.exceptions.ConnectTimeout: HTTPSConnectionPool' errors and failed uploads of large files.
Recommended action:
Increase artifact upload timeout: 'export MLFLOW_ARTIFACT_UPLOAD_TIMEOUT=300'. For very large models, bypass MLflow artifact logging and upload directly to S3, then log only the URI: 'mlflow.log_param("model_uri", "s3://bucket/path")'. Consider multipart uploads for files >100MB.