The underlying technology of all of these AI-powered tools that are continuous popping up in various sectors of the technology industry are Large Language Models (LLMs), which are powered by number-crunching graphics cards, or GPUs.
The explosion of AI has been fuelled by the possibility of graphics cards now being powerful enough to crunch large swaths of data into a model that can then be queried by users. However, powerful models such as ChatGPT, require thousands of GPUs to continuously fulfil all the requests of the millions of users enjoying the service.
But what if you just wanted to support a few thousand users at a time? Perhaps you are a business that could need an AI chatbot to assist with customer service via your website? And ultimately, do you need thousands of GPUs to achieve this? Apparently not, or at least according to benchmarks from Backprop, an Estonian GPU cloud start-up who managed to get a modest LLM such as Llama 3.1 8B to work on a single NVIDIA RTX 3090 GPU, which was released in late 2020.
The start-up found throughout its testing the RTX 3090 provided AI performance that was comparable to a customer service AI chatbot, with the GPU from 2020 only being able to fulfil requests from 100 concurrent users. In a test the RTX 3090 was able to serve a user 12.88 tokens per second, which is faster than the average person can read at five works per second, and faster than the industry standard for an AI chatbot, which is 10 tokens per second.
It appears its possible to run a customer service-equivalent AI chatbot with a single RTX 3090 GPU, and this AI chatbot could support thousands of users and fulfil the requests of 100s at any given time.