Ampere only launched six months ago, but Nvidia is upgrading the top-end version of its GPU to offer even more VRAM and considerably more bandwidth. The A100 (80GB) keeps most of the A100 (40GB)’s specifications: 1.41GHz boost clock, 5120-bit memory bus, 19.5 TFLOPs of single-precision, NVLink 3 support, and its 400W TDP are all unchanged from the previous iteration of the GPU. Both chips also feature 6,192 GPU cores.
What’s different is the maximum amount of VRAM (80GB, up from 40GB) and the total memory bandwidth (3.2Gbps HBMe, rather than 2.4Gbps HBMe). Bandwidth across the entire HBM2 array is 2TB/s, up from 1.6TB/s. This is a strong upgrade — it wouldn’t have been unusual for Nvidia to reduce the memory bandwidth of the array in order to double the capacity. Instead, the company boosted the total bandwidth by 1.25x.
The A100 features six stacks of HBM2, as you can see in the image above, but Nvidia disables one of the stacks to improve yield. The remaining five stacks each have a 1024-bit memory bus, which is where the 5120-bit bus figure comes from. Nvidia replaced the HBM2 on the 40GB A100 with HBM2E, which allowed it to substantially upgrade the base specs.
The 80GB flavor should benefit workloads that are both capacity-limited and memory bandwidth bound. Like the 40GB variant, the A100 80GB can support up to 7 hardware instances with up to 10GB of VRAM dedicated to each.
Nvidia is selling these GPUs in mezzanine cards expected to be deployed in either an HGX or a DGX configuration. Customers who want an individual A100 GPU in a PCIe card are still limited to the 40GB variant, though this could change in the future.
The price tag on a server full of 80GB A100 cards is going to be firmly in “if you have to ask, you can’t afford it” territory. But there’s a reason companies on the cutting edge of AI development might pay so much. GPU model complexity is limited by onboard memory. If you have to touch main system memory, overall performance will crater — CPUs may have the kind of DRAM capacities that AI researchers would love for their models, but they can’t provide the necessary bandwidth (and CPUs aren’t great for modeling neural networks in any case). Expanding the total pool of onboard VRAM may allow developers to increase the absolute complexity of the model they’re training or to tackle problems that couldn’t previously fit into a 40GB VRAM pool.
Comments