Data (and metadata) beat models

Sep 06, 2023

Some of you may have noticed, as I’m going through different use cases for biotech software, that I spent a measly one week on analysis, but I’m about to spend multiple weeks on data (and metadata) collection. Meanwhile, if you look at what everyone else is writing about, it’s LLMs and AlphaFold and generative models and so on, with very little about data collection. And I get it - those things are all much more interesting and flashy, and fun to think about. But the fact is, when it comes to what’s most important to an AI/data-driven biotech, it’s almost never the models. So this week, before we dig into data (and metadata) collection in all its gory details, I want to explain why I think data matters more than models.

Pretty much every ML/DS project I’ve been involved with in both academia or industry has worked more or less the same way: You set up the data, the labels, the train/test split, etc. and you select half a dozen models to compare to each other. One or two of the models, the ones you threw in as straw-man baselines, may not do so great. But the rest of the models, once you’ve done some hyper-parameter tuning, all tend to get within a few percentage points of each other. Sometimes it’s enough to legitimately say that one is better than the others, but usually it’s too small of a difference to actually matter in real life. If you want to see a large-scale example of this, look at the leader board of almost any Kaggle competition. There are exceptions, of course, but this is the vast majority of cases.

These models are bumping up against an information theoretic limit to the data itself. And there’s a name for what happens when you try to push past that limit: over-fitting.

Today, there is no shortage of ideas for how to make more complex and potentially more powerful models. The basic theory of neural networks defines a whole world of model architectures that’s still only partially explored. And every university with a CS department is full of folks with even more ideas. But until you try those ideas on actual data, they’re just ideas.

So there are essentially three ways you can make progress in AI/ML/DS: 1) When new data, with better information-theoretic limits becomes available, you can try out new model ideas that would over-fit on previously available datasets, 2) Once new models are validated, you can look for ways to apply them on more limited datasets to get a little closer to those information theoretic limits or 3) You can generate/collect/identify new datasets with better information theoretic limits.

Strategy (1) is the most fun, the easiest to get published, and it probably looks best on your resume. But biotech data is much messier and more expensive than data in most other domains, particularly Tech domains like optimizing ad revenue. So (1) is mostly going to happen in those other areas with the largest, most information-rich datasets. That means that most of the new models in biotech will come from strategy (2). We’re currently seeing this with all the “LLMs for biotech” discourse, and it will be interesting to see how this pans out. Maybe we haven’t yet hit the information theoretic limits of the available biotech datasets. We’ll find out soon enough.

But that leaves us with strategy (3), which is not only hardest, the most expensive and least intellectually stimulating option, but it’s also the least glamorous, and thus hardest to publish on. So for individuals, this isn’t a very appealing strategy. But for an organization, say a biotech startup that wants to create a “defensible moat” (as the VCs like to say) this should be the most appealing option for one simple reason: Ideas are cheap and plentiful. You shouldn’t bother trying to prevent others from stealing your ideas because if they’re good ideas, other people will come up with them independently. Model architectures, until they’re validated on data, are just ideas so the same applies to them.

Therefore, strategies (1) and (2) won’t give your startup a moat. Only (3) will. But you can’t leave it all to the biologists because they don’t think about data the way data scientists do. You need someone who understands both what it means for data to be more information rich (even if they don’t use those exact words) and the practical requirements (such as metadata) that will make the data usable. In other words, you still need data scientists for strategy (3). But they should be spending their time designing datasets instead of models.

JaMo

Sep 7, 2023

> But you can’t leave it all to the biologists because they don’t think about data the way data scientists do. You need someone who understands both what it means for data to be more information rich (even if they don’t use those exact words) and the practical requirements (such as metadata) that will make the data usable. In other words, you still need data scientists for strategy (3). But they should be spending their time designing datasets instead of models.

Exceptionally well said. Excited to see how you explore this is later posts and topics.

Expand full comment

Scaling Biotech

Discussion about this post