Metadata schemas are brittle

Jan 22, 2025

*** This week’s post has a Sponsored Highlight, for Briefly Bio, a tool that helps biotech lab teams expedite protocol handoffs and capture detailed lab metadata. I recently hosted a webinar with Briefly’s Harry Rickerby where we discussed this. You can find the recording here. Or to learn more about what they do, read the highlight below. ***

Last week, I argued that while it often feels like you should be able to create a schema that completely models any given body of information, in practice it’s effectively impossible to understand (in advance) the information that comes from a modern biotech lab well enough to actually do that. This week, just to drive the point home, I thought I’d walk through an example.

Imagine you’ve found yourself responsible for getting experiment metadata from the lab to the data science team at a small but fast growing biotech startup. To keep things simple, we’ll put aside plate maps, compound registries and a bunch of other complications to make this really simple: The lab runs a single assay and its readout is just a number between 0 and 1.

To run the assay, a bench scientist puts a certain concentration of a certain compound into a petri dish, waits a certain incubation time, then puts it into an instrument. The instrument makes a reading and generates a file that’s named based on an experiment ID. The contents of the file are this number, aka the readout.

The readout files somehow make their way to the data science team, but to interpret them, the data scientists need to know the compound, concentration and incubation time for each experiment. And sure, they could look that up in the ELN. But there are ten different lab teams running this assay, and each one has their own way of entering that information in the ELN, which they aren’t even consistent about. The data scientists don’t have time to poke around the ELN for that metadata.

So you, being a resourceful and scrappy young data engineer, create a Google sheet with four columns: Experiment ID, Compound ID, Concentration (micromolar) and Incubation Time (hours). You somehow convince all ten bench teams to fill in their data, and the data science team writes a script that joins this with the data from readout files. Problem solved.

Then, a week later, you find out that the bench teams are going to start using a different cell line for some of the experiments. No problem - just add a cell line column. But over the next few weeks, as this keeps happening, you start to get nervous about the number of columns this previously simple table has now.

Then one day, someone from the lab mentions they’re going to start running experiments where they’ll add the compound at two different time points. So should you just add another column? If they add more time points later, you don’t want to keep adding columns. And what if they add different concentrations at each time point? Different compounds?

The “proper” way to model this would be to create additional tables for treatments and time courses, and all sorts of things like that. (Normalization.) But that’s not going to work with a Google doc, at least not in a way that the bench teams would actually fill out. And since you don’t know which directions of complication the bench teams will need, it could end up being overkill anyway.

So this is the problem that I described last week: If you make the schema complex enough to consistently capture details for all the different possibilities, it becomes too complex for a person who’s in the middle of running an experiment, to actually fill it out.

And as bad as this is, remember that we started out with as simple an example as possible. Real labs with plate maps, registries and a dozen other complications are much much worse.

The potential for LLMs to address this problem is two-fold: First, it can help bench teams translate their intuitive understanding of the experiment into a more complex and flexible schema. And second, it can help the data science teams interpret metadata that is in a structured, but not necessarily consistent form, allowing the “schema” to be more flexible while remaining (slightly) less complex.

We’re not there yet, but to me that direction seems very promising.

Scaling Biotech

Discussion about this post

Scaling Biotech

Metadata schemas are brittle

Sponsored Highlight: Briefly Bio

Discussion about this post