As part of the new direction that I’ve been taking this newsletter in, I’ve been exploring a bunch of different scientific applications of AI in pharma and biotech. In the process, I’ve noticed that there are three roughly distinct categories of models that can often be used to solve the same problem. So this week I’m going to describe what they are, and look at a couple of examples of how they can be applied.
In between writing these posts, I’m developing a resource to help pharma and biotech leaders quickly assess the risk and ROI of opportunities to apply GenAI in drug discovery. If you’d like to be among the first to check it out when it’s ready, fill out this form.
Here are the three categories:
Mechanistic: These are models that try to directly model what’s happening with the biology. They may use experimental data to define or tune the model, but typically don’t use experimental data collected for the immediate question.
Black box: These are models that look for correlations in data, often from one or a small number of experiments designed to address the immediate question. They typically don’t model the biology or use external data in a significant way.
Knowledge: These are models that directly use curated information extracted from structured resources like ChEMBL, or from unstructured sources like research papers. They may model biological mechanisms to some extent, but it’s usually very limited. And they rarely use experimental data.
So first, I want to be clear that this is a general pattern, not a hard and fast rule. There are definitely exceptions. But I think as a broad pattern, it’s a useful categorization.
One example where all three kinds of models can be used to solve the same problem is with target identification: Given a disease indication, find a list proteins that can potentially be regulated to cure or mitigate the disease.
The obvious place to start is with a Knowledge model: Find every protein that has been associated with the disease in a published academic paper, research study, etc. There are structured resources that will help with this, or you can have an LLM read every paper on PubMed. Either way, you’re looking for direct associations that other teams have carefully identified and validated. It’s barely a model, but for the purposes of this post, it counts.
The next thing to try is a Mechanistic model of signaling pathways: How proteins and genes regulate each other. By directly modeling the mechanism by which one protein regulates a gene which regulates another and another, you can find proteins or genes that indirectly regulate the one you care about. You might start from the proteins that you found with the Knowledge model, or you might start somewhere else. Either way, you’re directly modeling the biological mechanism of protein regulation.
Then you might use a population genetics model to identify proteins that increase or decrease disease prevalence when they’re knocked out by gene variants. These Genome Wide Association Studies (GWAS) are just linear models. Or you might run a RNA-seq on a model of the disease state to directly measure which genes are up- or down-regulated when the disease is present. Again, you’re looking for correlations without directly modeling the underlying biology.
We can see a similar split in generative chemistry, where most approaches are either mechanistic or black box. Axiom, who I wrote about a few weeks ago, uses a black box model to predict toxicity by looking for a specific phenotype. The model doesn’t know why that’s the right phenotype. It just knows that it correlates with toxicity in a large dataset of past assays.
Meanwhile, Schrödinger, arguably the biggest player in this space, mostly uses Mechanistic models of how small molecules bind to a target protein. It can tell you exactly how your molecule will bind to individual proteins that cause toxicity. But it can’t measure the indirect effects that can still show up in Axiom’s model.
Knowledge models don’t make a lot of sense directly for anything generative, since by definition they’re only looking at things that were previously studied. But Axiom did use a Knowledge model to build the training dataset for their black box model by identify a large library of compounds that previously passed or failed human tox.
Since everyone’s excited about LLMs these days, I should mention that anything involving an LLM is probably a Knowledge model. You shouldn’t trust an LLM to reason about a mechanistic model of biology (or anything else for that matter). And I definitely wouldn’t trust them to do black box statistics. But they’re pretty good at extracting data from unstructured information.
What’s interesting about these three categories is that there are ways to use different kinds of models together, but it tends to be more of a post-processing step rather than fundamentally integrating them.
For target identification, you can merge the lists that come from the three different models, and maybe rank them based on how many models predicted each protein. You can do the same thing for predicting toxicity of generated compounds. But I can’t think of any examples that merge different kinds of models earlier in the process.
I don’t know how significant that is, but it seems like a missed opportunity.
Either way, when you’re assessing AI/ML applications in pharma and bio, I think this categorization can be a useful way to understand the capabilities and limitations.
Thanks for reading Scaling Biotech!
In between writing these posts, I’m developing a resource to help pharma and biotech leaders quickly assess the risk and ROI of opportunities to apply GenAI in drug discovery. If you’d like to be among the first to check it out when it’s ready, fill out the form at merelogic.net.
Clear and useful thinking, thank you. I continue to wonder about the useful limits of mechanistic thinking for biology. Less about AI, but just in general. Can we model complex systems by adding together all the parts? I think the answer is usually, “no” and our unwillingness to think about and study the properties of complete systems really holds us back.
U mite like (the pritty pictures in)
Semantic encoding during language comprehension at single-cell resolution
Mohsen Jamali, Benjamin Grannan ...
(re multimodal signal detection)
Nature volume 631, pages 610–616 (2024)
all best
Unka Jack