Meta Aims to Build the World’s Fastest AI Supercomputer

Meta, parent company of Facebook, says it has built a research supercomputer that is among the fastest on the planet. By the middle of this year, when an expansion of the system is complete, it will be the fastest, Meta researchers Kevin Lee and Shubho Sengupta write in a blog post today. The AI Research SuperCluster (RSC) will one day work with neural networks with trillions of parameters, they write. The number of parameters in neural network models have been rapidly growing. The natural language processor GPT-3, for example, has 175 billion parameters, and such sophisticated AIs are only expected to grow.

RSC is meant to address a critical limit to this growth, the time it takes to train a neural network. Generally, training involves testing a neural network against a large data set, measuring how far it is from doing its job accurately, using that error signal to tweak the network’s parameters, and repeating the cycle until the neural network reaches the needed level of accuracy. It can take weeks of computing for large networks, limiting how many new networks can be trialed in a given year. Several well-funded startups, such as Cerebras and SambaNova, were launched in part to address training times.

Among other things, Meta hopes RSC will help it build new neural networks that can do real-time voice translations to large groups of people, each speaking a different language, the researchers write. “Ultimately, the work done with RSC will pave the way toward building technologies for the next major computing platform—the metaverse, where AI-driven applications and products will play an important role,” they write.

“The experiences we’re building for the metaverse require enormous compute power (quintillions of operations / second!) and RSC will enable new AI models that can learn from trillions of examples, understand hundreds of languages, and more,” Meta CEO and cofounder Mark Zuckerberg said in a statement.

Old System: 22,000 Nvidia V100 GPUs
Today: 6,080 Nvidia A100 GPUs
Mid-2022: 16,000 Nvidia A100 GPUs

Compared to the AI research cluster Meta uses today, which was designed in 2017, RSC is a change in the number of GPUs involved, how they communicate, and the storage attached to them.

In early 2020, we decided the best way to accelerate progress was to design a new computing infrastructure from a clean slate to take advantage of new GPU and network fabric technology. We wanted this infrastructure to be able to train models with more than a trillion parameters on data sets as large as an exabyte—which, to provide a sense of scale, is the equivalent of 36,000 years of high-quality video.

The old system connected 22,000 Nvidia V100 Tensor Core GPUs. The new one switches over to Nvidia’s latest core, the A100, which has dominated in recent benchmark tests of AI systems. At present the new system is a cluster of 760 Nvidia DGX A100 computers, with a total of 6,080 GPUs. The computer cluster is bound together using an Nvidia 200-gigabit-per-second Infiniband network. The storage includes 46 petabytes (46 million billion bytes) of cache storage and 175 petabytes of bulk flash storage.

Speedups:
Computer vision: 20x
Large-scale natural-language processing: 3x

Compared to the old V100-based system, RSC marked a 20-fold speedup in computer vision tasks and a 3-fold boost in handling large natural-language processing. [READ MORE]