Designing Novel Proteins with Deep Hallucination

The Baker Lab at the University of Washington recently published a very interesting approach for finding novel proteins using an iterative, inverted technique (sometimes called hallucination) [1]. We think this approach is a viable first step to the rational design of unique, functional proteins for therapeutic and diagnostic applications.

Protein Folding

RoseTTAFold is a deep neural network model capable of predicting the 3D structure of proteins from their sequence [2]. Similar to AlphaFold from DeepMind [3], RoseTTAFold predicts protein structures to high accuracy but at a lower computational cost along with being available as a web server.

Hallucination

It is well known to computer vision researchers that deep neural networks that had been trained to label images, like whether an image is of a "cat", can be turned around and asked "Ok, so what do you think is the ideal looking image of a cat?" Starting with an image of a cat, the learned parameters of the model are used to iteratively update images to an ideal "cat" guided by finding the pathways of the network that maximize the label for "cat". The images that come out look like "cats" but are often visually "off" like a hallucination.

Inverting the Protein Folding Models

For biologists, instead of finding the ideal looking "cat", we can use RoseTTAFold to look for the ideal protein structure. A protein with an arbitrary sequence will fold into a structure that is feature-less, lacking tight-packing like alpha-helices and beta-sheets, similar to so-called "molten-globule" states. Baker’s team defined an ideal looking structure to be one that is as dissimilar as possible from this feature-less state. So starting with a random sequence, RoseTTAFold creates a structure (encoded as a distribution of pairwise distances) and assesses how similar it is (using KL-Divergence) from a representative feature-less structure. The sequence is then updated using random point mutations, and new sequences are accepted for the next round depending on the dissimilarity to the feature-less state. This process is repeated many times starting from many random sequences. Refer to Figure 1 in the paper for a visual overview of their approach (partially shown below) [1].

Novel, Functional Proteins Created

Over time, the resulting sequences converge to novel proteins, which actually fold to resemble “ideal” structures in low free energy states with regular alpha helices and beta sheets. These hallucinated proteins have features similar to known structures, while also having their own idiosyncrasies. Amazingly, these hallucinated proteins were even shown in wet-lab experiments to be stable.

This is an exciting first step in improving the rational design process for proteins. The key next ingredient would be in adding more control to this process; the authors want to investigate how to encode desired structural features when generating hallucinated proteins. For example, structural features from the surface of a pathogen, like new strains of SARS-CoV-2, could be used to generate sequences that fold to proteins that elicit an immune response, aiding in the design of new mRNA vaccines. Other examples that could be unlocked with protein structure hallucination include designing biomaterials with properties not possible with carbon polymers or designing enzymes that can act on previously indigestible materials, such as styrofoam, microplastics, or electronics. Improving our ability to generate new engineered proteins would lead to the next generation of biosensors, biomaterials, and therapeutics.

References

[1] Anishchenko I, Pellock SJ, Chidyausiku TM, Ramelot TA, Ovchinnikov S, Hao J, Bafna K, Norn C, Kang A, Bera AK, DiMaio F, Carter L, Chow CM, Montelione GT, Baker D. De novo protein design by deep network hallucination. Nature. 2021 Dec;600(7889):547-552. doi: 10.1038/s41586-021-04184-w. Epub 2021 Dec 1. PMID: 34853475.

[2] Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15. PMID: 34282049; PMCID: PMC7612213.

[3] Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D. Highly accurate protein structure prediction with AlphaFold. Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15. PMID: 34265844; PMCID: PMC8371605.

Connect with us

OmniScience is a leading AI organization advancing the mission of life science teams using our unparalleled expertise in biology & data science.

We accelerate our customers’ insights and advances in human health, therapeutics, and diagnostics. We are well versed in analytics for clinical trial operations, in developing advanced digital models for biomarkers and in the application of generative AI and machine learning in scientific data sources.

If you have an AI/ML-related question or would like to discuss how data science can help you, reach us at hello@omniscience.bio or on LinkedIn.

Written by:

Jonathan Gallion

VP of AI/ML

Published On:

February 2, 2022