What if alphafold could use Google?
My last couple of posts explored how the basic techniques of natural language LLMs can be repurposed to create models that make predictions about biological sequences - DNA, RNA and proteins. But it turns out there are a few tricks from natural language models that don’t translate to biological LLMs, or at least not directly. And this week I want to explore one of them: Retrieval Augmented Generation (RAG).
In general, RAG is any algorithm in which information is retrieved from a source and used to augment a generative model. (You probably guessed something like that from the name.)
The most common form of this in natural language LLMs is roughly what ChatGPT does when you see “Searching the web…” pop up while it’s thinking: Behind the scenes, it does a web search (Bing, not Google for contract reasons), downloads the first few pages that come up, and then uses them to create a query that it actually asks the LLM. Something like:
“I did a web search and found the following results: [Insert all the pages]. Use this information to answer the following question: [Your original query]”
In other contexts, you can do the same thing with information from other sources, such as your company’s internal document store. You just need a way for the LLM framework to search them.
But what if, instead of a folder of documents, you had a collection of protein structures that weren’t publicly known? Could you use a trick like RAG to get more accurate structure predictions from Alphafold (or any other model)?
As far as I can tell, the answer is yes and no.
It’s “no” because you can’t just put all the sequences together into a longer sequence and expect something good to come of it. Because Alphafold can’t interpret natural language, there’s no way to say “Here’s some background information, here’s the actual protein” within the prompt.
There are, however, ways to do this outside the prompt. In fact, according to this paper - Retrieved Sequence Augmentation for Protein Representation Learning - Alphafold is already doing this.
It turns out I glossed over an important step in my post on Alphafold: Before feeding sequences into the core model, Alphafold does something called Multi-Sequence Alignment (MSA) in which it takes the query sequence and finds similar sequences whose structures are known. Then it finds the corresponding sub-sequences between them all and turns this into a sort of augmented sequence that the core model reads.
The paper I linked above argues that this is effectively RAG. So if you had some proprietary protein structures, you could potentially get better results by inserting these at the MSA step.
But the paper also shows that you actually don’t need to do the alignment part. If you train your model by just feeding in all the sequences separately (glossing over lots of details), they claim that you can get results that are just as accurate. It’s still not exactly the same as just throwing everything it into the prompt like ChatGPT does because the model has to be specially trained. But it’s getting closer.
So I guess the answer to “What if Alphafold could use Google” is that it already kind of does…