Can AI solve drug repurposing?
Nature Medicine recently published an article called A foundation model for clinician-centered drug repurposing that caught my attention because its use of the term “foundation model” is slightly outside the usual definition. In particular, it isn’t a Large Language Model (LLM) or even use a transformer. But it does use some interesting techniques, including some of the same pieces as LLMs. So this week, I want to explain what the model, called TxGNN, does and how it works. My goal is to make this accessible for less technical readers.
In between writing these posts, I’m developing a resource I call AI Opportunity Cards to help pharma and biotech leaders quickly assess the risk and ROI of opportunities to apply AI in drug discovery. I just published a prototype, which you can check out here.
The authors of the paper claim that TxGNN can predict which rare diseases a compound will treat (an indication) or make worse (a contraindication) with significantly higher accuracy than benchmark models, even if there are no known treatments for the disease. As with any model like this, there’s an important question about how well this can practically fit into a drug development program. (That’s why I created AI Opportunity Cards.) But I’m going to gloss over that for this post and focus on how the model work.
The training data for the model is a knowledge graph called the Precision Medicine Knowledge Graph (PrimeKG) which is available to download. A knowledge graph is essentially a curated database of concepts (nodes) and links between the concepts. For example, this one includes concepts like disease, drug, protein, gene, etc. Links encode concepts from a pre-defined set of relationship types, such as which proteins regulate which genes and, more importantly for our context - which drugs treat which diseases.
The TxGNN model uses this knowledge graph to create an embedding of the concepts/nodes that encodes the relationships between them based on their positions. This general approach is called a Graph Neural Network (GNN), though GNN is a very general concept and TxGNN makes specific decisions about how to apply it.
The idea of an embedding is that we pick some high-dimensional vector space and assign a vector to each concept in the knowledge graph. There are two ways to think about this, both of which are useful (if slightly wrong):
The first is the numeric approach in which we think of each vector as a sequence of numbers, separated by commas. We can think of each slot between the commas as representing some underlying idea that the model identifies during training. If a concept/node has a value in that slot, then it’s somehow related to the idea. The bigger the number, the more it’s related.
One spot might mean “this concept has something to do with the liver” or “this concept affects metabolism” but really they mean much mode complex and unknowable things. It’s just nice to pretend.
The second way of thinking about the embedding vectors is geometrically. These are very high dimensional vectors and as humans we can only visualize up to three. (Never trust someone who says they can visualize higher dimensions.) So we’re going to think of the vectors as being in lower dimensions - two or three - for the sake of intuition. Again, it’s nice to pretend.
For every disease, drug, protein, gene, etc. the model assigns it a string of numbers separated by commas and we pretend that it’s a point in the plane or in three-dimensional space. Not so bad, right?
Next up is the relationships.
The model encodes the relationships between concepts by geometric transformations — something like a rotation in our two- or three-dimensional vector space. Say the vector representing a protein lives somewhere in the plane and the relationship “protein modulates gene” is a 20 degree rotation. The protein “expects” any gene it regulates to be a 20 degree rotation away from the vector that represents it.
Unfortunately, any gene that the protein regulates can’t be exactly there because it’s also regulated by other proteins that expect it to be somewhere else. Plus there are all the other relationships that define other places where the gene is “expected” to be.
So the model handles this by averaging all the places where the gene (or any other concept) is “expected” to be, then trying to put it there. It’s “try” because the gene is also telling other concepts/vectors where they should be, so if the model moves the gene, it has to move everything else. The training process is basically coming up with the best solution possible, given these constraints.
The point of all this is that if you can get all the concepts to be close to where the other concepts expect them to be, then you can start to predict relationships that aren’t explicitly represented in the graph. If you know that rotating a protein 20 degrees will bring you to a point that’s pretty close to any gene it regulates, and you find genes near there that you didn’t know about, you might be tempted to predict that it regulates those genes too.
Well, recall that one of the relationships in the knowledge graph was “drug treats disease.” So you can use this idea to predict other diseases that any given drug will treat.
There are reasons to be optimistic that this might work, beyond just the fact that AI seems to be magic. Recall that the vectors are meant to encode different complex and unknowable ideas. The reason AI seems magical is that when you force a model to figure out embeddings based on meaningful constraints, they tend to do it by secretly assigning meaningful (if still unknowable) ideas to those vector slots. And those ideas secretly power the predictions.
And sure enough, that seems to be what happens with TxGNN.
What I find even more interesting, though, is that TxGNN seems relatively simple for what it could be, at least compared to LLMs. For example, instead of encoding relationships with linear transformations, they could’ve used a multi-layer neural network. And they don’t employ anything like attention, which is the thing that makes LLMs work. So it seems like there’s still a lot of potential.
Of course, this is just an academic paper and there’s a big gap between improving prediction accuracy and getting novel drugs to market. But it seems promising enough to pay attention to. Watch this space.
Thanks for reading Scaling Biotech!
In between writing these posts, I’m developing a resource I call AI Opportunity Cards to help pharma and biotech leaders quickly assess the risk and ROI of opportunities to apply AI in drug discovery. I just published a prototype, which you can check out here.