Attention works for proteins too
Now that we’re finally ready to dig into biological foundation models, I want to start with what might’ve been the first one - certainly the first one to get widespread attention: AlphaFold. It’s a very complex model of a very complex subject, so I can’t cover anywhere close to the whole thing. But I want to address what I think is the most interesting part: How AlphaFold translates the idea of attention from transformers/LLMs into geometric information about proteins.
If you already know the basic idea behind protein folding, you can skip to the heading that says “The AlphaFold Part Starts Here”. Otherwise, here’s what you’ll need for the discussion below:
Proteins are long chains of amino acids with hinges between them that allow them to rotate at each joint. This allows the protein to flop around, so that amino acids that are far apart in the sequence can end up close to each other in space. Some pairs of amino acids are attracted to each others while others repel. When a protein is in a cell, these attractions and repulsions hold the protein together in a mostly rigid shape. (The word “mostly” is there because in practice, many proteins can flex in specific ways. But we’re going to ignore that.)
There are ways to empirically work out what these rigid structures are, but they’re extremely slow and expensive. So AlphaFold is one in a long history of attempts to predict the structures of proteins based on a combination of what we know about the attractions and repulsions from physics and what we’ve observed in other proteins.
Now, you’d think the physics should be enough to figure this out: Just model the amino acids as if they had springs pulling them together or pushing them apart, let the springs do their work, then see where it stops. But this doesn’t work because the amino acids get in each others’ way. The simulated proteins tend to get stuck on “local minimum” configurations that aren’t anywhere close to what they’re supposed to look like.
Also, technically, we don’t know if the configurations we see in nature are actually the most efficient (“lowest energy”) configurations possible. Scientists assume that the stochastic movements of amino acids as the protein is forming will find all possible configurations and pick the most efficient. But given that proteins are constructed one amino acid at a time, it’s completely possible that they get stuck in a local minimum in nature.
\/ \/ \/ The AlphaFold Part Starts Here \/ \/ \/
So the main problem that AlphaFold needs to solve isn’t to get the exactly correct configuration of the amino acids. If it can get close enough to the right configuration to get past other local minima, the physics simulation will do the rest.
There are two more major insights that I think of as the key to AlphaFold: The first one is that if you can estimate the distances between all pairs of amino acids and their relative orientations, you can use this to reverse engineer the overall structure. The second is that these distances and orientations between pairs of amino acids feel a lot like the attention mechanism in a transformer that I wrote about a couple weeks ago.
In a natural language LLM, attention tells the model which words in a partial sentence are most important for predicting the next word in the sentence. In other words, it’s a way of identifying relationships between pairs of words. Well, it turns out that the same general approach can be used to identify relationships between pairs of amino acids in a protein that can be translated into distance and orientation.
In practice, the kind of attention that AlphaFold uses is very different from natural language attention. In fact, it introduces a type of attention called triangle attention that evaluates relationships between triples of amino acids. But the high-level mechanism is the same.
So the overall process for a sequence of amino acids looks roughly like this:
AlphaFold calculates embedding vectors for each amino acid, based on the sequence before and after it, analogous to an LLM’s embedding that captures the meaning and intent of a word in a sentence.
AlphaFold calculates the attention between pairs of amino acids based on these embeddings.
AlphaFold converts these attention values into estimates for distance and relative orientations.
Because these calculations are all done by layers in a neural network, AlphaFold can train the model, all the way back to the embeddings, by back propagation on the actual distances and relative orientations of proteins with known structures. Then, when it’s time to predict new structures, it can use the outputs as a starting point for refinement.
So, again, I obviously simplified a lot of things and glossed over even more. But my goal was to give you a rough idea of how AlphaFold adapts the idea of attention from natural language models to predict protein structures.
I’ll dig into more biology foundation models next time, so stay tuned!