Elon Musk has announced a major milestone for his AI startup, xAI, which just turned its new AI training system "Colossus" online over the weekend.
Elon tweeted: "This weekend, the xAI team brought our Colossus 100K H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200K (50K H200s) in a few months. Excellent work by the team, NVIDIA and our many partners/suppliers".
Colossus is home to 100,000 of NVIDIA's current-gen Hopper H100 AI GPUs, while Musk says that soon the most powerful AI training system in the world will have 50,000 of NVIDIA's beefed-up H200 AI GPUs (faster HBM3E memory, and more of it over the H100 AI GPU).
xAI's flagship LLM -- Grok 2 -- was trained on 15,000 AI GPUs... so with Colossus having access to 100,000+ AI GPUs, we could see next-generation large language models with far better capabilities unleashed. Elon Musk himself said back in April 2024 that training Grok 3 would require 100,000 NVIDIA H100 AI GPUs... and just 5 months later we're here, with 100,000 NVIDIA H100 AI GPUs fired up and training away.
- Read more: Elon Musk's new Memphis Supercluster uses gigantic portable power generators, grid isn't enough
- Read more: Elon Musk turns on xAI's new AI supercomputer: 100K liquid-cooled NVIDIA H100 AI GPUs at 4:20am
- Read more: Elon Musk says training next-gen Grok 3 will require 100,000 NVIDIA H100 AI GPUs
NVIDIA's new Hopper H200 AI GPUs have up to 141GB of faster HBM3E memory, while H100 has up to 80GB of HBM3 memory. Elon Musk and the xAI team are surely having a field day with this immense amount of AI training power.