Cerebras’ Tech Trains “Brain-Scale” AIs

Cerebras Systems, whose CS-2 AI-training computer contains the world’s largest single chip, revealed that the addition of new memory system to the computer boosts the size of the neural networks it can train more than 100-fold to 120 trillion parameters. Additionally, the company has come up with two schemes that speed training by linking as many as 192 systems and by efficiently dealing with so-called “sparsity” in neural networks. Cerebras’ cofounder and chief hardware architect Sean Lie detailed the technology involved today at the IEEE Hot Chips 33 conference.

The developments come from a combination of four technologies—weight streaming, MemoryX, SwarmX, and selectable sparsity. The first two expand the size of the neural networks CS-2 can train by two orders of magnitude, and they present a shift in the way Cerebras computers have operated.

The CS-2 is designed to train large neural networks quickly. Much of the time saving comes from the fact that the chip is large enough to keep the entire network, consisting primarily of sets of parameters called weights and activations, on the chip. Other systems lose time and power, because they must continually load a fraction of the network onto a chip from DRAM and then store them to make room for the next portion.

With 40 gigabytes of on-chip SRAM, the computer’s processor, WSE2, can fit the whole of even the largest of today’s common neural networks. But these networks are growing at a rapid pace, increasing 1000-fold in the last few years alone, and they are now approaching 1 trillion parameters. So even a wafer-sized chip is starting to fill up.

To understand the solution, you first have to know a bit about what happens during training. Training involves streaming in the data that the neural net will learn from, and measuring how far from accurate the network is. This difference is used to calculate a “gradient”—how each weight needs to be tweaked to make the network more accurate. That gradient is propagated backward through the network layer by layer. Then the whole process is repeated until the network is as accurate as needed. In Cerebras’ original scheme, only the training data is streamed onto the chip. The weights and activations remain in place and the gradient propagates within the chip.

“The new approach is to keep all the activations in place and pour the [weight] parameters in,” explains Feldman. The company constructed a hardware add-on to the CS-2 called MemoryX, which stores weights in a mix of DRAM and Flash and streams them into the WSE2 where they interact with the activation values stored on the processor chip. The gradient signal is then sent to the MemoryX unit to adjust the weights. With weight streaming and MemoryX a single CS-2 can now train a neural network with as many as 120 trillion parameters, the company says.

Feldman says he and his cofounders could see the need for weight streaming back when they founded the company in 2015. “We knew at the very beginning we would need two approaches,” he says. However “we probably underestimated how fast the world would get to very large parameter sizes.” Cerebras began adding engineering resources to weight streaming at the start of 2019.

The other two technologies Cerebras unveiled at Hot Chips are aimed at speeding up the training process. SwarmX is hardware that expands the WSE2’s on-chip high-bandwidth network so that it can include as many as 192 CS-2s. Constructing clusters of computers to train massive AI networks is fraught with difficulty, because the network has to be carved up among many processors. The result does not often scale up well, says Feldman. That is, doubling the number of computers in the cluster does not typically double the training speed.

Cerebras’ MemoryX system delivers and manipulates weights for neural network training in the CS-2. The SwarmX network allows up to 192 CS-2s to work together on the same network. [READ MORE]