In the last few posts I’ve been exploring how different LLMs and related models work in biology and elsewhere. And while that’s cool and exciting and all, I wanted to take a step back this week and share a note of caution. Because as exciting as this all is, there is still a legitimate risk that drug discovery using these models could actually be worse than without it.
Sure, these new generative models are far more accurate than any models that existed before. In some cases they’re making predictions that models simply didn’t exist for previously. But just because a model can beat a benchmark in an academic paper doesn’t mean it will make a difference in getting a drug to market faster, cheaper or with a lower failure rate.
In fact, more accurate models could actually increase the risk of failure in later stages, depending on what they’re accurate about. Because the drug discovery process is a sequence of increasingly accurate and expensive proxies, and over-fitting to proxies is much more dangerous than under-modeling them.
The only metrics we really care about in drug discovery are the ones that get measured in the Phase 3 clinical trial: large-scale safety and efficacy in humans. But to get there, we first have to measure things that are correlated with those things: in-vitro binding assays, in-cell functional assays, in-vivo animal models, etc. Even small-scale Phase 1 and 2 human studies are just proxies for Phase 3.
The benefit of these proxies is that they’re cheaper and faster (and don’t put human subjects at excessive risk.) The drawback is that the correlation between the proxy and what you really want to measure can break down, particularly when there are factors that you didn’t account for.
So the problem for these LLMs and generative models is that they’re trained almost exclusively on data from early stage proxies. In the last few posts, it was protein structures, impact on gene expression, cellular phenotypes, and a few others. These models need a lot of training data, and there just haven’t been enough Phase 3 clinical trials to train a model by themselves.
Now, on their own, proxies aren’t a problem. In fact, they’re the basis for modern drug discovery. What makes proxies a problem is that ML models are really good at over-fitting. They will find any pattern in your data, whether it’s supposed to be there or not. And in particular, when you’re working with proxies, they’ll find a lot of patterns that are more about the proxy than the thing you care about.
More classical approaches to analyzing data from these proxies involve a person with years of experience that give them a better chance of distinguishing real signals from fake. When you take that person out of the loop, you increase the chances of wasting time and resources on spurious patterns.
So when folks in biotech are skeptical of the impact that these models will actually have on the bottom line, this may be what they have in mind. In practice, it will probably be a mixed bag of good and bad. But given that it takes on the order of a decade to get from early discover to Phase 3, it’ll be a while before we know for sure.