*** Quick question before we start: Would you be interested joining a peer group of 6-8 people from similar biotech data roles/levels? We would meet every 2-4 weeks (depending on the group’s preference) to discuss problems and questions that each group member is facing. I got enough interest so far for a few groups, but I wanted to give everyone another change to sign up by filling out this form to express your interest. ***
Last week, I wrote about four factors that tend to be in most definitions of a Large Language Model (LLM) or Large (Whatever) Model (LxM). Of the four, there’s one that’s there not because it makes sense with the definition, but because it was the key to making (most of) them work and is probably very relevant to biotech data. It also broke one of my core beliefs about how progress is made in machine learning. (More on that at the end.) So this week I wanted to explore this factor in detail: It’s the use of unsupervised or semi-supervised learning over supervised learning.
Supervised learning includes things like sentiment analysis of text: You build a model that tells you whether a given sentence is saying something positive or negative about its subject. Independent of the code that might allow you to do this, you need something else to train and evaluate such a model: You need example sentences that a person has tagged as positive or negative.
Those tags are called labels, and compared to scraping the sentences themselves off of any site you find on the internet, creating the labels is very expensive. So historically, these labels have been the bottleneck to creating large training sets, making the size of training sets the bottleneck to better models.
The clearest example of this is image recognition (which is supervised learning). Convolutional Neural Networks (CNNs) became the de facto standard for this task after 2012. But by 2012, CNNs had already been around for over 40 years. (The first published paper on CNNs was in 1969.) What had changed was the creation of ImageNet, a massive set of labeled images that made it possible to actually train them.
Based on this, and other examples, and my intuition for how ML training works, I always assumed that any future ML breakthroughs would come from new sources of (labeled) data, particularly in biotech where readout technology is advancing so quickly.
LLMs broke this assumption because in this case, the data was already there. It was the model that was new.
The story actually starts with the word2vec model that was published in 2013. This model is trained to predict a word in a sentence from the words just before and after it. So it’s still taking inputs and making a prediction. But in practice, what this means is that you don’t need to label the sentences any more. Your training set is all the text you can find on the internet. Since we’re not adding additional labels, we call it unsupervised.
The big limitation to word2vec was that it could only look at a small number of words right before and after the word you care about. Without information outside that window, it wasn’t actually very good at predicting text. So this model was mostly used to train word embeddings, which is a whole different topic.
The breakthrough that made LLMs possible was a paper called Attention Is All You Need that proposed the key idea: You can actually train another model to identify which words in a sentence are relevant to predicting the next word, then use those identified words to make the prediction. This model is called a transformer and it turned out to magically work. After running with this ideas for five more years, we got ChatGPT.
Now, you could argue that what attention/transformers/unsupervised learning did was to turn data that couldn’t be used for training into data that could be. So in some sense the data was still the bottleneck. But I don’t think it counts.
Maybe the real lesson is that while data is usually the bottleneck, a better model may be the thing that unlocks it.
Next time I’ll start exploring what this all has to do with biology.
LSTMs could be trained in a self-supervised way, just not efficiently. Transformers allowed parallelization of training so you could scale up model size which was the main breakthrough