NVIDIA AI GPUs trained Meta's new Llama 3 model for the cloud, edge, and RTX PCs

NVIDIA announces optimizations across all of its platforms to accelerate Meta's new Llama 3, the latest LLM, with NVIDIA tech boosting the generative AI.

VIEW GALLERY - 2

Anthony Garreffa

@anthony256

Published Apr 18, 2024 11:00 PM CDT
Updated May 1, 2024 10:00 AM CDT

1 minute & 46 seconds read time

NVIDIA has just announced optimizations across all of its platforms to accelerate Meta Llama 3, Meta's latest-generation large language model (LLM).

NVIDIA AI GPUs trained Meta's new Llama 3 model for the cloud, edge, and RTX PCs 04

VIEW GALLERY - 2 IMAGES

The new Llama 3 model combined with NVIDIA accelerated computing provides developers, researchers, and businesses with innovation across various applications. Meta engineers trained their new Llama 3 on a computing cluster featuring 24,576 NVIDIA H100 AI GPUs linked through the NVIDIA Quantum-2 InfiniBand network; with support from NVIDIA, Meta tuned its network, software, and model architectures for its flagship Llama 3 LLM.

To further advance the state-of-the-art generative AI, Meta recently described plans to scale its AI GPU infrastructure to an astonishing 350,000 NVIDIA H100 AI GPUs. That's a lot of AI computing power, a ton of silicon, probably a city's worth of power, and an incredible sum of money on AI GPUs ordered by Meta from NVIDIA.

Read more: Meta orders NVIDIA's next-gen Blackwell B200 AI GPUs, shipments later this year
Read more: Meta's long-term vision for AGI involves 600,000 x NVIDIA H100-equivalent AI GPUs

NVIDIA has said that versions of Meta's new Llama 3, accelerated on NVIDIA AI GPUs, are now available for use in the cloud, data center, edge, and PC. From your own browser, you can test Llama 3 right here, packaged as an NVIDIA NIM microserver with a standard application programming interface that can be deployed anywhere.

NVIDIA explains on its website: "Best practices in deploying an LLM for a chatbot involves a balance of low latency, good reading speed and optimal GPU use to reduce costs. Such a service needs to deliver tokens - the rough equivalent of words to an LLM - at about twice a user's reading speed which is about 10 tokens/second. Applying these metrics, a single NVIDIA H200 Tensor Core GPU generated about 3,000 tokens/second - enough to serve about 300 simultaneous users - in an initial test using the version of Llama 3 with 70 billion parameters. That means a single NVIDIA HGX server with eight H200 GPUs could deliver 24,000 tokens/second, further optimizing costs by supporting more than 2,400 users at the same time".

	Today	Yesterday	7 days ago	30 days ago
	$139.99	$139.99	$139.99	$139.99	Buy
	$29949.95	$29949.95	$30099.99	$30099.99	Buy
* Prices last scanned on 4/30/2024 at 9:45 pm CDT - prices may not be accurate, click links above for the latest price. We may earn an affiliate commission.

Today

Yesterday

7 days ago

30 days ago

* Prices last scanned on 4/30/2024 at 9:45 pm CDT - prices may not be accurate, click links above for the latest price. We may earn an affiliate commission.