Supercomputer Emulator: AI’s New Role in Science

Artificial intelligence has become an indispensable tool in many scientists’ lives, such that its use by researchers now has its own moniker—AI4Science—used by conferences and laboratories. Last month, Microsoft announced its own AI4Science initiative, employing dozens of people spread across several countries. Chris Bishop, its director, started on the science side before gravitating to AI. He earned a Ph.D. in quantum field theory at the University of Edinburgh, then worked in nuclear fusion before machine learning caught his eye in the 1980s. He began applying neural networks to his own work. “I was kind of 25 years early,” he says, “but it really has taken off.” He joined Microsoft Research’s Cambridge lab in 1997, eventually becoming its director, and now has a new role. We spoke about the evolution of the scientific method, lasers versus beer, and nerdy T-shirts.

IEEE Spectrum:What is Microsoft AI4Science?

Chris Bishop: All it really is is a new team that we’re building. We see a very exciting opportunity over the next decade at the intersection of machine learning and the natural sciences—chemistry, physics, biology, astronomy, and so on. It goes beyond simply the application of machine learning in the natural sciences.

How does it go beyond that?

Bishop: There was a technical fellow at Microsoft, Jim Gray, who talked about four paradigms of scientific discovery. The first paradigm is the purely empirical. It’s observing regularities in the world around us.

“We see a new paradigm emerging. You can trace its origins back many decades, but it’s a different way of using machine learning in the natural sciences.”
—Chris Bishop, Microsoft Research

The second paradigm is the theoretical. Think of Newton’s laws of motion, or Maxwell’s equations. These are typically differential equations. It’s an inductive step, an assumption that they describe the world more generally. An equation is incredibly precise over many scales of length and time, and you can write it on your T-shirt.

The third transformation in scientific discovery began in the middle of the 20th century, with the development of digital computers and simulations, effectively solving these differential equations for weather forecasting and other applications.

The fourth paradigm, taking off in the 21st century, was not about using computers to solve equations from first principles. It’s rather analyzing empirical data at scale using computers. Machine learning thrives in that space. Think of the Large Hadron Collider, the James Webb Space Telescope, or protein-binding experiments.

These four paradigms all work together.

We see a new paradigm emerging. You can trace its origins back many decades, but it’s a different way of using machine learning in the natural sciences. In the third paradigm, you run a complicated simulation on a supercomputer; then the next day, somebody asks a different question. You take a deep breath, more coin to the electricity meter. We can now use those simulation inputs and outputs as training data for machine-learning deep neural nets, which learn to replicate or emulate the simulator. If you use the emulator many times, you amortize the cost of generating the training data and the cost of training. And now you have this hopefully fairly general-purpose emulator, which you can run orders of magnitude faster than the simulation.

Roughly how much simulation data is needed to train an emulator?

Bishop: A lot of machine learning is an empirical science. It involves trying out different architectures and amounts of data and seeing how things scale. You can’t say ahead of time, I need 56 million data points to do this particular task.

What is interesting, though, are techniques in machine learning that are a little bit more intelligent than just regular training. Techniques like active learning and reinforcement learning, where a system has some understanding of its limitations. It could request more data where it has more uncertainty.

What are emulation’s weaknesses?

Bishop: They can still be computationally very expensive. Additionally, emulators learn from data, so they’re typically not more accurate than the data used to train them. Moreover, they may give insufficiently accurate results when presented with scenarios that are markedly different from those on which they’re trained.

“I believe in “use-inspired basic research”—[like] the work of Pasteur. He was a consultant for the brewing industry. Why did this beer keep going sour? He basically founded the whole field of microbiology.”
—Chris Bishop, Microsoft Research [READ MORE]