Can AI structure the data for you?
*** Two quick notes:
I’m hosting a third webinar, this time with Harry Rickerby from Briefly Bio, on January 16, 2025 at 2pm EST. We’ll be talking about Data you can trust: How Briefly helps biotech labs generate consistent, reproducible data. You can sign up here.
This Thursday, I’m doing a Fireside with the folks from Invert at a meetup in Cambridge, MA. If you’re in the Boston area, come say “hi”. You can register here.
***
Happy new year! I want to kick off 2025 by diving right back into likely use cases and impacts of LLMs on early discovery biotech data. In particular, this week I want to explore how LLMs can make it easier to capture and structure data that would otherwise be effectively lost.
By “effectively lost,” I mean information that is recorded in a form where someone could theoretically use it, but only through great time and effort. Think emails, free text lab notes, decisions captured in slide decks...
The great thing about LLMs is that, in theory at least, they have the time and energy to make this information usable. But there are two very different conclusions you can draw from this and I only think one of them is right.
The first possible conclusion is that we don’t need to worry about carefully structuring data any more. Just throw it all into a big pile and let the LLM sort it out. Forget databases, or even spreadsheets. Let 2025 be the year of stream of consciousness data capture.
But that is very much NOT the conclusion that I want you to draw. And the reason is simple: LLMs are good at interpreting unstructured information, but they’re not perfect. Remember, my rule is that LLMs should only be used in situations where a person can verify their work. If they’re sorting through and interpreting gigabytes of unstructured data, that just isn’t possible.
So the second conclusion you can draw, and the one I want you to draw, is almost the opposite of the first: We should be using LLMs to collect more structured data and less unstructured data.
The theory has always been that it’s easier to clean data upstream (where it’s collected) than downstream (where it’s used.) Upstream you have more context, and you can focus on a small amount of data at a time. Downstream you have too much data to put too much time into any one thing, plus you have no way to interpret ambiguous or conflicting data.
In practice, however, cleaning (or structuring) data upstream is really hard because the folks collecting the data usually have bigger things on their minds than translating the complex ground truth into a clean relational schema. Wet lab biologists trying to keep their cells alive. Screening teams hoping the robot won’t flip another plate. You’re lucky to get free text.
LLMs change the situation because it can do the work of translating the complex ground truth into a structured form while the person who knows the ground truth is still there to verify it. In fact, if you play your cards right you can get the LLM to actually make their job easier. LLMs are even OK with schemas that allow the kind of flexibility the science needs, but are too complex for a person to deal with. Think JSON instead of CSV.
So I think the right model for involving AI in data capture isn’t to hope the LLM can sort things out when it’s time for analysis (though that’s better than nothing.) It’s to integrate LLMs into the places where data is captured, to make sure it’s captured right.
This strategy still faces operational and social headwinds, just like any attempt to improve upstream data collection will. But at least it has a bit of a technical advantage over the tools we had a few years ago.