Novel proteins across sequence, structure, and function

The new generation of antibodies, enzymes, peptides and other proteins will be designed and engineered - not discovered or screened. ML is bound to play a major role in this mindset shift, from identifying sequences to resolving structures and crafting functions.

The design of de novo proteins never seen before in nature is a highly sought-after goal. Natural proteins have not evolved to serve the highly specialized functions that we now want them to carry out. We need to explore nature’s uncharted territory.

One method for such exploration is directed evolution: starting with a natural protein and mutating it until a desired function is achieved. For many research areas, no natural proteins can serve as this starting point. We need better means of sampling the sequence space.

Similar to language, protein sequences can be represented by a string of 20 amino acids. As such, we are able to capitalize on the major advances in NLP to model proteins, including large language models and their underlying transformer architectures.

Labeled biomedical data is often difficult to come by. One major advantage of using language models in protein modeling is their self-supervised nature. These models learn by masking or perturbing random portions of the protein sequence and then attempting to autocomplete them.

A protein’s sequence is inherently linked to its stable folded structure, which in turn dictates its functionality. Protein folding is a long standing problem with 10^143 ways to fold - Levinthal’s paradox. AlphaFold has made some remarkable progress in this area.

Predicting a protein’s 3D structure from its sequence - with some degree of accuracy - is bound to accelerate our search for “unnatural” proteins. As AlphaFold provides mere predictions, experimental methods such as X-ray crystallography remain as the gold standard.

Beyond sequence and structure, we have function. Whether these functions are therapeutic (affinity, immunogenicity, stability) or biomanufacture-related (titer, rate, and yield), this is a multi-objective optimization problem where ML tools can provide much value.

The best ML models will learn from experimental validation and tight integration into iterative feedback loops with the wet lab. They will also learn to work with the nuances of proteins including their multi-state conformational space and their highly variable sequence lengths.