Can you trust AI to clean your data?

Dec 04, 2024

*** I have another Sponsored Highlight at the bottom of this week’s posts, this time for Sphinx Bio, a tool that has been pushing the boundaries of AI for biotech. I recently hosted a webinar with Sphinx’s Nicholas Larus-Stone where we discussed this. You can find the recording here. Or to learn more about what they do, read the highlight below. ***

The last few weeks, I’ve been exploring different use cases of LLMs in early discovery biotech based on the following framework: I’m looking for things LLMs can do that are hard for non-expert users to do, but relatively easy for non-experts to sanity check. And given where LLMs are today, the sanity check is the more important of the two.

Two posts back, I explored what this looks like for data analysis/computational biology. This week, I want to explore a use case that fits this model even better: Cleaning up and consistently reformatting data.

The reason this one fits the model so well is that a lot of data cleaning is enforcing relatively simple rules: Rename certain columns. Map values from one vocabulary to another. Pivot or un-pivot a column or two.

The hard part of this is looking at a dataset and deciding which of these operations to apply. In fact, that isn’t even necessarily that hard for a single dataset - it’s just tedious. And when you have lots of datasets - say a decades’ worth of data in arbitrary formats that your company has been stashing in random folders in Sharepoint because no one felt empowered to define a consistent process - well, then it becomes overwhelmingly difficult.

So when you bring a hammer like an LLM to a nail like this one, the point isn’t to have the LLM apply the reformatting. You want those transformations to be 100% consistent, repeatable and reliable - three things that LLMs are not. Plus, sanity checking the work would require scanning the entire dataset, which kind of defeats the purpose.

On the other hand, there are already 100% consistent, repeatable and reliable conventional tools for applying the reformatting transformations. What these tools don’t do is to help you decide what transformations to apply.

And that’s exactly where LLMs make sense: If an LLM can scan the datasets and propose simple rules for how they can be transformed into a consistent format, it’s relatively easy for a person to sanity check just the rules, then use conventional tools to apply them.

The LLM does what it’s good at, and leaves what it’s bad at to something else.

I think this is a good model for many applications of LLMs: The LLM creates an intermediate structured *something* that a non-expert user can review, then pass to (more reliable) conventional tools to act on. There are a number of things that fit this pattern. I’ll explore some that I’m aware of in upcoming posts. (I’m sure there are a lot more I haven’t thought of.) Then, I’ll move on to explore other models for LLMs and generative AI in biotech.

Scaling Biotech

Discussion about this post

Scaling Biotech

Can you trust AI to clean your data?

Sponsored Highlight: Sphinx Bio

Discussion about this post