Cudagraph capture size configuration affects memory and performance

info

performanceUpdated Mar 24, 2026

Sources

torch.compile integration - vLLMdocs.vllm.ai

Technologies:

BentoMLsubject

vLLMSymptoms of this issue are visible in vLLM metrics and logs

How to detect:

vLLM automatically determines cudagraph capture sizes by default. Captured cudagraphs consume memory proportional to the number of sizes captured. Without appropriate size configuration, workloads may not hit captured graphs and run slower eager mode, or capture too many sizes wasting memory.

Recommended action:

Override default cudagraph capture sizes using --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}' to match your workload's typical batch sizes. This provides fine-grained control over memory vs performance tradeoff. For attention backends that are cudagraph compatible, consider full cudagraph capture for improved decode speed on smaller models or MOEs.