What is a foundation model?
*** Quick question before we start: Would you be interested joining a peer group of 6-8 people from similar biotech data roles/levels? I’ve been wanting to organize something like this for a while, but I didn’t think I could get enough participants to make it happen. You can prove me wrong by filling out this form to express your interest. If I get enough signups, I’ll start forming groups. ***
So far, my discussion of AI/LLMs has focused on applying natural language LLMs (like, the one behind ChatGPT) to the operations of a biotech data team. This week, I want to start looking at more general “foundation models.” I put that in quotes, because it’s still kind of a fuzzy definition. So in this post, I want to explore what the term actually means.
Every definition I’ve seen lists a number of factors that are generally true for foundation models. Each list is slightly different, and most lists of foundation models tend to include ones that violate one or two factors in most definitions. But let’s look at the most common ones.
The factor that seems to be present in every definition is that a foundation model can be used for a wide range of different tasks. Think of it as the foundation for other, more narrowly scoped models. For example, the model behind ChatGPT can answer all kinds of different questions. Technically, it answers them all by performing the single task of predicting the next word(s) in a string of text. But no need to nitpick, right? The point is that the one task it does is very general and has lots of applications.
The next factor is that foundation models are generally trained using unsupervised or self-supervised learning. The more common supervised learning is where you split each data point into features and labels, then train the model to predict the labels from the features. This won’t work for a foundation model because supervised learning only teaches the model a single task (predicting labels). Again, this gets a little fuzzy because, for example ChatGPT’s model is still trained using a single task - predicting the next word(s) in a string of text - it just isn’t explicitly a label.
After that, a factor you’ll see a lot is that foundation models are very large, both in terms of the number of trained parameters that make up the model and the volume of data that goes into training the parameters. (The two go hand in hand.) Hence the “large” in Large Language Model (LLM). This is generally necessary to create a model that can accomplish multiple tasks, but it seems more correlation than a fundamental requirement for a foundation model.
And the final factor that you’ll see a lot is that foundational models tend to be based on transformers, which are a type of model that are designed to predict the next term(s) in a sequence. For a natural language LLM, it’s a sequence of words. But it could also be a sequence of amino acids, peptides etc. So most of the biological foundation models that you see, for example on this list, are for DNA, RNA and proteins. But you’ll also see “foundation models” for things like single cell data (at the bottom of that list) which aren’t sequences, and often using other model architectures besides transformers. So this is by far the most fuzzy of the four factors.
So, this gives us a rough idea of what counts as a foundation model, while leaving plenty of wiggle room. Ultimately, the one that’s most important - both in terms of how the term is used and why people care - is the idea of being able to train one model for multiple tasks. In the next few posts, I’ll start to explore what this looks like for biology/biotech. Stay tuned!