BentoMLvLLM

Cudagraph capture size configuration affects memory and performance

info
performanceUpdated Mar 24, 2026
How to detect:

vLLM automatically determines cudagraph capture sizes by default. Captured cudagraphs consume memory proportional to the number of sizes captured. Without appropriate size configuration, workloads may not hit captured graphs and run slower eager mode, or capture too many sizes wasting memory.

Recommended action:

Override default cudagraph capture sizes using --compilation-config '{"cudagraph_capture_sizes": [1, 2, 4, 8]}' to match your workload's typical batch sizes. This provides fine-grained control over memory vs performance tradeoff. For attention backends that are cudagraph compatible, consider full cudagraph capture for improved decode speed on smaller models or MOEs.