Metadata schemas are brittle
*** This week’s post has a Sponsored Highlight, for Briefly Bio, a tool that helps biotech lab teams expedite protocol handoffs and capture detailed lab metadata. I recently hosted a webinar with Briefly’s Harry Rickerby where we discussed this. You can find the recording here. Or to learn more about what they do, read the highlight below. ***
Last week, I argued that while it often feels like you should be able to create a schema that completely models any given body of information, in practice it’s effectively impossible to understand (in advance) the information that comes from a modern biotech lab well enough to actually do that. This week, just to drive the point home, I thought I’d walk through an example.
Imagine you’ve found yourself responsible for getting experiment metadata from the lab to the data science team at a small but fast growing biotech startup. To keep things simple, we’ll put aside plate maps, compound registries and a bunch of other complications to make this really simple: The lab runs a single assay and its readout is just a number between 0 and 1.
To run the assay, a bench scientist puts a certain concentration of a certain compound into a petri dish, waits a certain incubation time, then puts it into an instrument. The instrument makes a reading and generates a file that’s named based on an experiment ID. The contents of the file are this number, aka the readout.
The readout files somehow make their way to the data science team, but to interpret them, the data scientists need to know the compound, concentration and incubation time for each experiment. And sure, they could look that up in the ELN. But there are ten different lab teams running this assay, and each one has their own way of entering that information in the ELN, which they aren’t even consistent about. The data scientists don’t have time to poke around the ELN for that metadata.
So you, being a resourceful and scrappy young data engineer, create a Google sheet with four columns: Experiment ID, Compound ID, Concentration (micromolar) and Incubation Time (hours). You somehow convince all ten bench teams to fill in their data, and the data science team writes a script that joins this with the data from readout files. Problem solved.
Then, a week later, you find out that the bench teams are going to start using a different cell line for some of the experiments. No problem - just add a cell line column. But over the next few weeks, as this keeps happening, you start to get nervous about the number of columns this previously simple table has now.
Then one day, someone from the lab mentions they’re going to start running experiments where they’ll add the compound at two different time points. So should you just add another column? If they add more time points later, you don’t want to keep adding columns. And what if they add different concentrations at each time point? Different compounds?
The “proper” way to model this would be to create additional tables for treatments and time courses, and all sorts of things like that. (Normalization.) But that’s not going to work with a Google doc, at least not in a way that the bench teams would actually fill out. And since you don’t know which directions of complication the bench teams will need, it could end up being overkill anyway.
So this is the problem that I described last week: If you make the schema complex enough to consistently capture details for all the different possibilities, it becomes too complex for a person who’s in the middle of running an experiment, to actually fill it out.
And as bad as this is, remember that we started out with as simple an example as possible. Real labs with plate maps, registries and a dozen other complications are much much worse.
The potential for LLMs to address this problem is two-fold: First, it can help bench teams translate their intuitive understanding of the experiment into a more complex and flexible schema. And second, it can help the data science teams interpret metadata that is in a structured, but not necessarily consistent form, allowing the “schema” to be more flexible while remaining (slightly) less complex.
We’re not there yet, but to me that direction seems very promising.
Sponsored Highlight: Briefly Bio
Last week, I sat down with Harry Rickerby from Briefly Bio to explore how they’re using LLMs to help scientists capture structured lab protocols and logs. Briefly allows scientists to write free text protocols, which it converts to structured instructions. They can then add details, create checklists for the lab, and record changes that they made for each experiment.
We discussed three scenarios where capturing this data in a structured form makes a huge difference:
Protocol handoffs: When someone new joins the team, or another team needs to start running an assay you developed, training them to run the assay in a consistent way that produces comparable data can be a huge time commitment. Having detailed protocol documentation slashes that time and ensures more reliable results.
Experiment reproducibility: Often, the most important experiment that your lab does turns out to be one dozens or hundreds of similar experiments that were indistinguishable at the time you ran them. So when you ran this experiment, you probably weren’t paying close attention. When you go back and run it again to verify the result, having a detailed log of what you did could be the difference between a quick verification and months trying to reproduce a magic number.
Metadata capture: Like I described above, capturing lab protocols and logs, aka metadata, in a pre-defined tabular schema is complex and brittle. Capturing it in a structured, but flexible form allows data scientists to determine a structure that works after the fact.
This addresses a lot of the problems that I like to write about on this newsletter, and I’m excited to see more labs using Briefly.