Can you trust AI to clean your data?
*** I have another Sponsored Highlight at the bottom of this week’s posts, this time for Sphinx Bio, a tool that has been pushing the boundaries of AI for biotech. I recently hosted a webinar with Sphinx’s Nicholas Larus-Stone where we discussed this. You can find the recording here. Or to learn more about what they do, read the highlight below. ***
The last few weeks, I’ve been exploring different use cases of LLMs in early discovery biotech based on the following framework: I’m looking for things LLMs can do that are hard for non-expert users to do, but relatively easy for non-experts to sanity check. And given where LLMs are today, the sanity check is the more important of the two.
Two posts back, I explored what this looks like for data analysis/computational biology. This week, I want to explore a use case that fits this model even better: Cleaning up and consistently reformatting data.
The reason this one fits the model so well is that a lot of data cleaning is enforcing relatively simple rules: Rename certain columns. Map values from one vocabulary to another. Pivot or un-pivot a column or two.
The hard part of this is looking at a dataset and deciding which of these operations to apply. In fact, that isn’t even necessarily that hard for a single dataset - it’s just tedious. And when you have lots of datasets - say a decades’ worth of data in arbitrary formats that your company has been stashing in random folders in Sharepoint because no one felt empowered to define a consistent process - well, then it becomes overwhelmingly difficult.
So when you bring a hammer like an LLM to a nail like this one, the point isn’t to have the LLM apply the reformatting. You want those transformations to be 100% consistent, repeatable and reliable - three things that LLMs are not. Plus, sanity checking the work would require scanning the entire dataset, which kind of defeats the purpose.
On the other hand, there are already 100% consistent, repeatable and reliable conventional tools for applying the reformatting transformations. What these tools don’t do is to help you decide what transformations to apply.
And that’s exactly where LLMs make sense: If an LLM can scan the datasets and propose simple rules for how they can be transformed into a consistent format, it’s relatively easy for a person to sanity check just the rules, then use conventional tools to apply them.
The LLM does what it’s good at, and leaves what it’s bad at to something else.
I think this is a good model for many applications of LLMs: The LLM creates an intermediate structured *something* that a non-expert user can review, then pass to (more reliable) conventional tools to act on. There are a number of things that fit this pattern. I’ll explore some that I’m aware of in upcoming posts. (I’m sure there are a lot more I haven’t thought of.) Then, I’ll move on to explore other models for LLMs and generative AI in biotech.
Sponsored Highlight: Sphinx Bio
AI is making previously intractable problems tractable and unreasonably time consuming tasks routine across biotech. Two weeks ago, I sat down with Nicholas Laurus-Stone from Sphinx Bio to explore how their AI agent, Metis, is putting these capabilities at their users’ fingertips.
Sphinx allows biotech teams to radically improve the way they work by applying AI in three key areas:
Managing semi-structured data: Metis proposes transformations that will turn new datasets into a pre-defined, standard form ready for analysis.
Clearing the analysis bottleneck: Metis helps non-computational biologists define analysis and visualizations in a structured form that they, or the computational team, can review, modify and reuse.
Aggregating diverse datasets: Metis automatically detects datasets with similar schemas and merges them, allowing teams to easily leverage data across multiple experiments.
Sphinx’s Metis does more than just give teams their time back: It allows them to dive deeper into the data and focus on the things that actually move the needle. To learn more, check out sphinxbio.com.