GTC 2020 NVIDIA has officially unveiled its first GPU based on the Ampere GPU architecture, the new NVIDIA A100, which the company has in full production and is already shipping to customers worldwide.
NVIDIA's new A100 GPU packs an absolutely insane 54 billion transistors (that's 54,000,000,000), 3rd Gen Tensor Cores, 3rd Gen NVLink and NVSwitch, and much more. The GPU itself measures 826mm2 (!!!) on TSMC's 7nm node, as well as packing a huge 40GB of HBM2 memory from Samsung, and up to 600GB/sec bandwidth through NVLink.
The new A100 is being compared against the V100, which is based on the Volta GPU architecture. NVIDIA's current-gen Tesla V100 comes with 16GB and 32GB HBM2 options, while the V100 was built on the 12nm node at TSMC it packs just 21 billion transistors in comparison.
There's some incredible things going on under the Ampere hood, with NVIDIA able to offer the largest leap in performance yet: 20x. The new A100 GPU is 20x faster than the V100, with peak FP32 training of 312 TFLOPs, peak Int8 inference of 1248 TOPs, and FP64 HPC of 19.5 TFLOPs.
NVIDIA founder and CEO Jensen Huang, explains: "The powerful trends of cloud computing and AI are driving a tectonic shift in data center designs so that what was once a sea of CPU-only servers is now GPU-accelerated computing. NVIDIA A100 GPU is a 20x AI performance leap and an end-to-end machine learning accelerator -- from data analytics to training to inference. For the first time, scale-up and scale-out workloads can be accelerated on one platform. NVIDIA A100 will simultaneously boost throughput and drive down the cost of data centers".
NVIDIA explains its new Ampere A100 GPU as a "technical design breakthrough field by five key innovations". These innovations include:
- Ampere architecture: At the heart of A100 is the NVIDIA Ampere GPU architecture, which contains more than 54 billion transistors, making it the world's largest 7-nanometer processor.
- Third-generation Tensor Cores with TF32: NVIDIA's widely adopted Tensor Cores are now more flexible, faster and easier to use. Their expanded capabilities include new TF32 for AI, which allows for up to 20x the AI performance of FP32 precision, without any code changes. In addition, Tensor Cores now support FP64, delivering up to 2.5x more compute than the previous generation for HPC applications.
- Multi-instance GPU: MIG, a new technical feature, enables a single A100 GPU to be partitioned into as many as seven separate GPUs so it can deliver varying degrees of compute for jobs of different sizes, providing optimal utilization and maximizing return on investment.
- Third-generation NVIDIA NVLink: Doubles the high-speed connectivity between GPUs to provide efficient performance scaling in a server.
- Structural sparsity: This new efficiency technique harnesses the inherently sparse nature of AI math to double performance.
NVIDIA Ampere A100 specs:
- Transistors: 54 billion
- CUDA cores: 6912
- Double-precision performance: 7.8 TFLOPs
- Single-precision performance: 15.7 TFLOPs
- Tensor Performance: 125 TFLOPs
- Node: 7nm TSMC
- Memory: 40GB HBM2e
- Memory bus: 6144-bit
- Memory bandwidth: 1.6TB/sec
- Tensor Cores: 1024 3rd Gen
- Interface: PCIe 4.0 x16
- TDP: 400W
NVIDIA Volta V100 (SXM2) specs:
- Transistors: 21.1 billion
- CUDA cores: 5120
- Double-precision performance: 7.8 TFLOPs
- Single-precision performance: 15.7 TFLOPs
- Tensor Performance: 125 TFLOPs
- Node: 12nm TSMC
- Memory: 16/32GB HBM2
- Memory bus: 4096-bit
- Memory bandwidth: 900GB/sec
- Tensor Cores: 640 (1st Gen)
- Interface: PCIe 3.0 x16
- TDP: 250-300W