*** A quick note: I’m hosting a webinar tomorrow, with Harry Rickerby from Briefly Bio, on January 16, 2025 at 2pm EST. We’ll be talking about Data you can trust: How Briefly helps biotech labs generate consistent, reproducible data. You can sign up here. ***
In last week’s post, I argued that LLMs/AI actually give us more reason to capture source data, including lab metadata such as protocols and process logs, in a structured form rather than the more common free text (or nothing.) This week, I want to dig into why capturing structured data at the source is so difficult. And in particular, I want to explore an idea that feels kind of like variance/bias tradeoff in statistics, though that may be more vibes than an actual connection.
Here’s my claim: The more detail your schema captures, the more ambiguity there will be about how to capture it.
The rest of the post explains what I mean and why I think this happens.
For the purposes of this week’s post, I’ll use the word “schema” to mean a set of rules and conventions for capturing information/data. If you’re capturing it in one or more tables, then the schema is the set of columns, the data types of those columns, the rules for what’s allowed in each column, etc. If you’re working with a nested data format like JSON, it’s the fields and subfields, the rules for how they’re formatted, etc.
The fun thing about schemas is that if you think about them long enough, it starts to feel like there’s a fundamental inevitability to them. Like if you could understand the underlying, abstract nature of the information you want to capture, a natural schema will present itself, and anyone else with a reasonable understanding of database design would come up with more or less the same design. At least, that’s always been my intuition, and it seems to be what the folks who invented database normalization were trying to codify.
But to understand the underlying, abstract nature of the information, we have to write down assumptions and put things in clear categories. And it turns out, the real world will always find a way to break your assumptions and categories.
For example, to create an assay/experiment schema, you start with a list of the types of experiments that your lab runs, and create rules for capturing the metadata for each type. The lab team tells you that the QPCR assay always includes three technical replicates, so you bake that into the QPCR experiment table. Then the next day, they tell you they ran an experiment with four replicates. And the day after that, they tell you about a new assay that combines elements of QPCR and dPCR. Then it’s more time points, new measurements,… you get the idea.
As you add to the schema to try and capture all these special cases and new details, you run into a new problem: Sometimes that new assay gets classified as a QPCR assay and sometimes it goes in the new category. With all the new fields you added to handle extra replicates and time points, there are now multiple ways to encode that information (and the scientists are using all of them). And of course, there’s a whole new list of corner cases that you still have to deal with.
So we’re left with this tradeoff: Either keep things simple so that there’s always a single, canonical way to encode (limited) information into the schema, or capture more of the details of reality and accept that there will often be multiple, equally reasonable, ways to encode the same information.
Keeping things simple has the benefit that it takes the pressure off whoever’s entering the data to make decisions about how to encode it. Bench scientists don’t want to make mistakes, so if there’s ambiguity they often just don’t do it. It also makes it less likely that whoever’s analyzing the data will misinterpret different encodings.
But with LLMs, I think things swing back towards more complete schemas, since LLMs can help manage more complex encoding, and can help interpret the data for downstream analysis. They can manage the ambiguity on both ends and they make the structured data much more valuable compared to unstructured information.
In other words, LLMs don’t make the tradeoff go away. But they do change the calculus.
I like how this post also touches on a big issue in data capture in the Life Sciences - tribal knowledge. Different domains in science view things in slightly different ways, so while a biologist might think metadata field A means one thing, a chemist might view it as something different. There’s also the issue in schema design when the person who initially develops it leaves the company, and and someone new comes in and doesn’t understand why the schema was designed in the way that it was.
I think that LLMs can definitely help here, but it will be interesting to see how different domain experts prime LLMs in different ways. Grounding LLMs with clearly defined rules and documentation certainly helps, but depending on how the data is captured an out of the box model might extract and organize data in different ways depending on what group in a company generated that data.