GPT Language Model Spells Out New Proteins

Articles

Human languages have much in common with proteins, at least in terms of computational modeling. This has led research teams to apply novel methods from natural-language processing (NLP) to protein design. One of these—Birte Höcker’s protein design lab at Bayreuth University, in Germany—describes ProtGPT2, a language model based on OpenAI’s GPT-2, to generate novel protein sequences based on the principles of natural ones.

Just as letters from the alphabet form words and sentences, naturally occurring amino acids combine in different ways to form proteins. And protein sequences, just like natural languages, store structure and function in their amino-acid sequence with extreme efficiency.

ProtGPT2 is a deep, unsupervised model that takes advantage of advances in transformer architecture that have also caused rapid progress in NLP technologies. The architecture has two modules, explains Noelia Ferruz, a coauthor of the paper and the person who trained ProtGPT2: one module to understand input text, and another that processes or generates new text. It was the second one, the decoder module that generates new text, that went into the development of ProtGPT2.

“At the time we created this model, there were many others that were using the first module,” she says, such as ESM, ProtTrans, and ProteinBERT. “Ours was the first one publicly released at a time that was a decoder.” It was also the first time someone had directly applied GPT-2, she adds.

Ferruz herself is a big fan of GPT-2. “I find it very impressive that there was a model capable of writing English,” she says. This is a well-known transformer model that was pretrained on 40 gigabytes of Internet text in English in an unsupervised manner—that is, it used raw text with no human labeling—to generate the next word in sentences. The GPT-x series has been shown to efficiently produce long, coherent text, often indistinguishable from something written by a human—to the extent that potential misuse is a concern.

Given the capabilities of GPT-2, the Bayreuth researchers were optimistic about using it to train a model to learn the protein language, generate stable proteins, and also explore “dark” regions of the protein space. Ferruz trained ProtGPT2 on a data set of about 50 million nonannotated sequences across the whole protein space. To evaluate the model, the researchers compared a data set of 10,000 sequences generated by ProtGPT2 with a random set of 10,000 sequences from the training data set. [READ MORE]