§ blog · AI & ML06/15/2026
← All articles

How important are VRAM and HBM in AI infrastructure: from inference to fine-tuning

When choosing AI infrastructure, FLOPS usually gets mentioned first — but in practice, VRAM capacity and HBM bandwidth are often what determine which models can run, at what batch size and latency. How to size VRAM for production, why HBM differs from GDDR, and when you need multiple GPUs.

AI & MLGPUVRAMHBM7 min read
FIG.B-08 · GPU MEMORY · VRAM / HBMBANDWIDTH-BOUNDDATACENTER GPU · HBMGPU DIEHBM3 · ~3.2 TB/sCONSUMER GPU · GDDRGPU DIEGDDR6 · ~1.0 TB/s24GB VRAM · ALLOCATIONWEIGHTS · ~5GBKV CACHE · GROWS ↑FREECAPACITY ≠ THROUGHPUTBANDWIDTH = LATENCY

When discussing AI infrastructure, the number that comes up first is usually FLOPS — "how much more powerful is this GPU than that one". But in practice, especially for inference — which accounts for most of the infrastructure cost once a model exists — two other factors usually decide things first: VRAM (what the GPU memory can hold) and memory bandwidth, where HBM (High Bandwidth Memory) is the dominant technology on datacenter GPUs. A GPU with high FLOPS but insufficient VRAM to load the model, or insufficient bandwidth to "feed" its compute units, won't reach its theoretical performance — sometimes ending up slower than a "weaker" GPU on paper that has a memory configuration better suited to the workload. This article breaks down why VRAM and HBM matter, and how we size GPUs when designing AI infrastructure for a Pilot Build.

VRAM — the first bottleneck of inference

Before a GPU can compute anything, all (or most) of a model's weights must be loaded into VRAM. This is a hard requirement, not an optimization concern: if the model doesn't fit in VRAM, it simply doesn't run on that GPU, full stop.

  • A 7B-parameter model at FP16 precision needs roughly 14GB just to hold the weights — before accounting for anything else. Quantizing to INT4/INT8 (as covered in the article on Qwen3-VL) brings this down to roughly 4-5GB, enough to run on a commodity 16-24GB VRAM GPU
  • KV cache — temporary memory storing attention state for each generated token — grows with both context length and the number of concurrent requests (batch size). With long contexts (tens of thousands of tokens, common with RAG using many documents), the KV cache can take up more VRAM than the model weights themselves
  • If the total VRAM needed (weights + KV cache + framework overhead) exceeds available VRAM, the system has to offload part of it to system RAM (CPU) — and since CPU-GPU bandwidth is far lower than internal VRAM bandwidth, latency can increase 10-50×, turning a "few hundred ms" response into several seconds

What HBM is, and how it differs from GDDR

VRAM isn't a single type of memory — the manufacturing technology determines bandwidth, and bandwidth determines whether a GPU's compute units get "fed" enough data to run at full capacity.

  • HBM (High Bandwidth Memory) stacks multiple memory dies vertically, connected to the GPU die through a silicon interposer with thousands of parallel signal paths — delivering very high bandwidth (HBM3 reaches roughly 3+ TB/s per GPU) but more expensive and complex to manufacture
  • GDDR (used in consumer/gaming GPUs) achieves lower bandwidth (GDDR6 around 1 TB/s) but is cheaper and easier to integrate — enough for most graphics workloads, but often a bottleneck for large AI models
  • Many AI workloads — especially inference at small batch sizes — are memory-bound, not compute-bound: the GPU spends more time "waiting" for data from memory than waiting for computations to finish. This is why datacenter GPUs (H100, A100, MI300 — using HBM) cost dramatically more than consumer GPUs (RTX — using GDDR), even though the theoretical FLOPS gap isn't proportionally as large

Practical consequence: two GPUs with roughly similar FLOPS but different memory configurations (HBM vs. GDDR, or different VRAM capacities) can show several-fold differences in inference throughput for the same model — FLOPS alone doesn't reflect real-world AI performance.

Sizing GPUs for production — questions to answer first

  • Which model, at what precision/quantization? — determines the fixed VRAM needed for weights (e.g., 7B at INT4 ≈ 5GB, 13B at INT4 ≈ 8GB, 70B at INT4 ≈ 40GB)
  • What's the maximum context length and how many concurrent requests? — determines VRAM for the KV cache, which scales roughly linearly with both; at a 32K context and a batch of 8, the KV cache alone can exceed 10GB depending on model size
  • What throughput (requests/second) is needed, or what's the maximum acceptable latency per request? — determines whether you need to batch multiple requests together (higher throughput, more VRAM) or serve each request individually (lower latency, less VRAM but typically lower GPU utilization)
  • Will fine-tuning run on the same infrastructure? — fine-tuning (especially full fine-tuning, not LoRA) needs VRAM for gradients and optimizer state, typically 3-4× the VRAM needed for inference on the same model

When you need multiple GPUs

When a model doesn't fit in a single GPU's VRAM — common with 30B+ parameter models or MoE (mixture-of-experts) architectures, as mentioned in the article on Qwen3-VL — the model needs to be split across multiple GPUs:

  • Tensor parallelism — splits each layer of the model across multiple GPUs, each computing a portion, with results combined afterward — requires very high inter-GPU bandwidth (NVLink, not PCIe) since GPUs must synchronize continuously within every layer
  • Pipeline parallelism — splits the model along its layers, with each GPU holding a contiguous group of layers — requires lower inter-GPU bandwidth but can create "bubbles" (GPUs waiting on each other) if load isn't well balanced
  • In practice, many serving frameworks (vLLM, TensorRT-LLM) support combining both — but adding more GPUs doesn't scale performance linearly if inter-GPU (or inter-node) bandwidth becomes the new bottleneck

Conclusion

Sizing GPUs for a production AI system shouldn't start with "which GPU is the strongest within budget", but with: which model and quantization, what context length and concurrency in practice, and whether the workload is memory-bound or compute-bound. Right-sizing VRAM and bandwidth avoids both extremes — too little (out-of-memory, falling back to CPU and slowing the whole system) and too much (wasted GPU spend on capacity/bandwidth that's never used). This is part of the infrastructure design we do for every Pilot Build in the AI & ML layer — choosing GPUs based on the specific problem, not generic benchmarks.

Have a similar problem to solve?

Contact the team