Predicting function from sequences

Feb 26, 2025

This week, continuing with the theme of biological foundation models, I want to write about the Evo 2 model that the Arc Institute released last week. But the main difference between Evo 2 and its predecessor, Evo, seems to be some technical improvements that allow Evo 2 to leverage longer sequences. And while these improvements are cool (maybe even worth writing about in the future), all the things I want to write about are in the original model. So that’s what I’ll write about.

First a quick recap: Large language models (LLMs) use a concept called attention to turn a sequence of “tokens” (words for natural language, nucleotides or amino acids in biological contexts) into embedding vectors that encode the meaning behind the sequence into dimensions/features. These features usually aren’t interpretable, but they can be used to predict other things like the picture described by the text or the distances between amino acids in the protein.

To train an LLM, you take the errors in these predictions and apply an algorithm called back propagation (which is just the chain rule from multi-variable calculus, but that’s also another post…) to update how the LLM calculates the embeddings. If you do this for long enough, and with enough data, these embeddings become really good and predictive.

This is exactly the trick that the Evo model uses, except that it does two important things that make it more general than other biological models, such as alpha fold.

The first difference is that it predicts more than one thing. Alpha fold predicts protein structures (by first predicting the distances and relative orientations between amino acids.) Evo does this too, but it also predicts the functions of protein, loss of function from variants (by comparing the embedding vectors of the wild type and mutation), and probably a few other things I’m missing.

So you have the core model that defines the embeddings, then multiple branches to secondary models defined by additional neural network layers. Each piece of training data has a label that fits into one of the secondary branches. When you apply back propagation, it updates the layers in that branch plus the layers in the core LLM. So if you do this with data for all the different branches, you get a single embedding that is predictive for all those different branches/tasks.

The second improvement is that Evo makes these predictions for DNA and RNA as well as for proteins. How to do this isn’t immediately obvious because RNA and DNA are sequences of nucleotides, not base pairs. An LLM can only work with one “alphabet” of tokens. You can’t just give it sequences from a different alphabet.

Evo’s developers got around this issue by giving the model a larger alphabet, consisting of the 22 amino acids plus the four nucleotides for a single 26 character alphabet. (Or maybe 27? I don’t know if they treat T and U as the same…)

But that doesn’t solve problem, it just creates a new one: Since the training data is split between sequences that are either entirely in amino acids or entirely in nucleotides, there’s a risk that the model will learn two separate embeddings in different parts of the embedding space.

So to fix this, the developers added another task to train the model on: alignment between DNA, RNA and proteins. This both gives them more data to train on and ensures that the embeddings all line up.

If you combine those two changes with the ideas from GPT and Alphafold that I wrote about the last few weeks, you get a very high-level description of Evo, with lots of details left out or glossed over. But I think I at least hit the important concepts.

If you like this kind of semi-technical, mostly conceptual discussion, let me know in the comments. And stay tuned for more next week.

Christian Stolte

Feb 26

I love how you break things down, Jesse! Curious to see what you find the most useful aspects of Evo2 -- functional genome/variant annotation?

Expand full comment

Noah Peterson

Feb 27

This is wild. The idea of using a single embedding space for DNA, RNA and proteins feels like a step toward a universal biological language model. The workaround with the expanded alphabet is clever,kind of like forcing a multilingual AI to think in concepts rather than just words. Makes me wonder what else could be thrown into the mix.

Scaling Biotech

Discussion about this post