Supercomputers Flex Their AI Muscles

Articles

Scientific supercomputing is not immune to the wave of machine learning that’s swept the tech world. Those using supercomputers to uncover the structure of the universe, discover new molecules, and predict the global climate are increasingly using neural networks to do so. And as is long-standing tradition in the field of high-performance computing, it’s all going to be measured down to the last floating-point operation.

Twice a year, Top500.org publishes a ranking of raw computing power using a value called Rmax, derived from benchmark software called Linpack. By that measure, it’s been a bit of a dull year. The ranking of the top nine systems are unchanged from June, with Japan’s Supercomputer Fugaku on top at 442,010 trillion floating point operations per second. That leaves the Fujitsu-built system a bit shy of the long-sought goal of exascale computing—one-million trillion 64-bit floating-point operations per second, or exaflops.

But by another measure—one more related to AI—Fugagku and its competitor the Summit supercomputer at Oak Ridge National Laboratory have already passed the exascale mark. That benchmark, called HPL-AI, measures a system’s performance using the lower-precision numbers—16-bits or less—common to neural network computing. Using that yardstick, Fugaku hits 2 exaflops (no change from June 2021) and Summit reaches 1.4 (a 23 percent increase).

By one benchmark, related to AI, Japan’s Fugaku and the U.S.’s Summit supercomputers are already doing exascale computing.

But HPL-AI isn’t really how AI is done in supercomputers today. Enter MLCommons, the industry organization that’s been setting realistic tests for AI systems of all sizes. It released results from version 1.0 of its high-performance computing benchmarks, called MLPerf HPC, this week.

The suite of benchmarks measures the time it takes to train real scientific machine learning models to agreed-on quality targets. Compared to MLPerf HPC version 0.7, basically a warmup round from last year, the best results in version 1.0 showed a 4- to 7-fold improvement. Eight supercomputing centers took part, producing 30 benchmark results.

As in MLPerf’s other benchmarking efforts, there were two divisions: “Closed” submissions all used the same neural network model to ensure a more apples-to-apples comparison; “open” submissions were allowed to modify their models.

The three neural networks trialed were:

CosmoFlow uses the distribution of matter in telescope images to predict things about dark energy and other mysteries of the universe.
DeepCAM tests the detection of cyclones and other extreme weather in climate data.
OpenCatalyst, the newest benchmark, predicts the quantum mechanical properties of catalyst systems to discover and evaluate new catalyst materials for energy storage.

In the closed division, there were two ways of testing these networks: Strong scaling allowed participants to use as much of the supercomputer’s resources to achieve the fastest neural network training time. Because it’s not really practical to use an entire supercomputer-worth of CPUs, accelerator chips, and bandwidth resources on a single neural network, strong scaling shows what researchers think the optimal distribution of resources can do. Weak scaling, in contrast, breaks up the entire supercomputer into hundreds of identical versions of the same neural network to figure out what the system’s AI abilities are in total.

Here’s a selection of results:

Argonne National Laboratories used its Theta supercomputer to measure strong scaling for DeepCAM and OpenCatalyst. Using 32 CPUs and 129 Nvidia GPUs, Argonne researchers trained DeepCAM in 32.19 minutes and OpenCatalyst in 256.7 minutes. Argonne says it plans to use the results to develop better AI algorithms for two upcoming systems, Polaris and Aurora.

The Swiss National Supercomputing Centre used Piz Daint to train OpenCatalyst and DeepCAM. In the strong scaling category, Piz Daint trained OpenCatalyst in 753.11 minutes using 256 CPUs and 256 GPUs. It finished DeepCAM in 21.88 minutes using 1024 of each. The center will use the results to inform algorithms for its upcoming Alps supercomputer.

Fujitsu and RIKEN used 512 of Fugaku’s custom-made processors to perform CosmoFlow in 114 minutes. It then used half of the complete system—82,944 processors—to perform the weak scaling benchmark on the same neural network. That meant training 637 instances of CosmoFlow, which it managed to do at an average of 1.29 models per minutes for a total of 495.66 minutes (not quite 8 hours).

Helmholtz AI, a joint effort of Germany’s largest research centers, tested both the JUWELS and HoreKa supercomputers. HoreKa’s best effort was to chug through DeepCAM in 4.36 minutes using 256 CPUs and 512 GPUs. JUWELS did it in as little as 2.56 minutes using 1024 CPUs and 2048 GPUs. For CosmoFlow, its best effort was 16.73 minutes using 512 CPUs and 1024 GPUs. In the weak scaling benchmark JUWELS used 1536 CPUs and 3072 GPUs to plow through DeepCAM at rate of 0.76 models per minute. [READ MORE]