NVIDIA has just announced optimizations across all of its platforms to accelerate Meta Llama 3, Meta's latest-generation large language model (LLM).
The new Llama 3 model combined with NVIDIA accelerated computing provides developers, researchers, and businesses with innovation across various applications. Meta engineers trained their new Llama 3 on a computing cluster featuring 24,576 NVIDIA H100 AI GPUs linked through the NVIDIA Quantum-2 InfiniBand network; with support from NVIDIA, Meta tuned its network, software, and model architectures for its flagship Llama 3 LLM.
To further advance the state-of-the-art generative AI, Meta recently described plans to scale its AI GPU infrastructure to an astonishing 350,000 NVIDIA H100 AI GPUs. That's a lot of AI computing power, a ton of silicon, probably a city's worth of power, and an incredible sum of money on AI GPUs ordered by Meta from NVIDIA.
- Read more: Meta orders NVIDIA's next-gen Blackwell B200 AI GPUs, shipments later this year
- Read more: Meta's long-term vision for AGI involves 600,000 x NVIDIA H100-equivalent AI GPUs
NVIDIA has said that versions of Meta's new Llama 3, accelerated on NVIDIA AI GPUs, are now available for use in the cloud, data center, edge, and PC. From your own browser, you can test Llama 3 right here, packaged as an NVIDIA NIM microserver with a standard application programming interface that can be deployed anywhere.
NVIDIA explains on its website: "Best practices in deploying an LLM for a chatbot involves a balance of low latency, good reading speed and optimal GPU use to reduce costs. Such a service needs to deliver tokens - the rough equivalent of words to an LLM - at about twice a user's reading speed which is about 10 tokens/second. Applying these metrics, a single NVIDIA H200 Tensor Core GPU generated about 3,000 tokens/second - enough to serve about 300 simultaneous users - in an initial test using the version of Llama 3 with 70 billion parameters. That means a single NVIDIA HGX server with eight H200 GPUs could deliver 24,000 tokens/second, further optimizing costs by supporting more than 2,400 users at the same time".