Technology content trusted in North America and globally since 1999
8,419 Reviews & Articles | 64,450 News Posts

Google's next-gen Tensor processor: 45 TFLOPs of power

Google packs up to 180 TFLOPs of performance on a single board, 45 TFLOPs per Tensor processor

By Anthony Garreffa from May 18, 2017 @ 22:20 CDT

Google has just unveiled its second-generation tensor processor, something that packs 45 TFLOPs of performance per chip, with four of them placed onto a tensor processor unit (TPU) module for a total of 180 TFLOPs.


The massively powerful systems are built for machine learning and artificial intelligence, and Google is pushing it into the cloud with their TPU-based computational powerhouse systems to be made available to Google Cloud Compute later this year. Google's first-gen Tensor processors were already 15-30x more powerful, and a huge 30-80x more power efficient than CPUs and GPUs for these types of workloads.

These new TPUs are "optimized for both workloads, allowing the same chips to be used for both training and making inferences. Each card has its own high-speed interconnects, and 64 of the cards can be linked into what Google calls a pod, with 11.5 petaflops total; one petaflops is 1015 floating point operations per second", reports Ars Technica.


Ars Technica points out that making comparisons between machine learning solutions is "difficult", because most GPUs have their performance measured with single precision FLOPs, which are based on 32-bit numbers. The GPUs can work with double precision (64-bit numbers), and half precision mode (16-bit). Machine learning workloads normally work on half precision when they can, but the first-gen TPUs from Google didn't use floating point at all, they used 8-bit interger approximations to floating point.

For comparisons sake, AMD's new Radeon Vega Frontier Edition has an estimated 13 TFLOPs of single precision compute performance, and 25 TFLOPs of half precision compute performance. NVIDIA's beefty new Volta-based Tesla V100 graphics solution packs 15 TFLOPs of single precision, and 120 TFLOPs for "deep learning" workloads.


Related Tags