It matters which details matter (to you).

Oct 04, 2023

Now that we’ve gotten through doing analysis and transferring instrument data, we’re finally at my favorite part - collecting metadata. You can think about a typical biology experiment as two parts: 1) You set up a biological system, whether it’s in a dish or in cells, or in an animal, etc. with carefully chosen conditions. 2) You measure something about this system. The things that you measure in Step 2 become the data. The information about how you set up the system in Step 1 is the metadata.

But here’s the thing I’ve always found most frustrating about how most labs collect metadata: In theory, you should start recording metadata as soon as you start planning an experiment. I mean, the whole point of planning the experiment is to decide which conditions you’re fixing and how you’ll vary the rest. That’s the metadata. And yet most lab systems are set up to capture this metadata - in the form of plate maps and sample sheets - relatively late in the process. So, what gives?

My working theory is that it has to do with two different views of an experiment, each of which emphasizes different types of details. When I, as someone on the data side of things, think about metadata and experiment design, I think in terms of a static view of the experiment: Here are the conditions, here are the outcomes. When a biologist, on the other hand, thinks about experiment design, they’re much more likely to think in terms of a process view: here are the steps I did to create the set of fixed and varying conditions then measure the outcome. These two different views are the best way for each of us to do our jobs.

So fine, we have two different views of the same information. We should be able to just write a script that translates from one to the other, right?

The problem is that the level of detail and the required flexibility/consistency is completely different between the two views. For example, in the static view any two cell lines are interchangeable. It’s just a name in a column. In the process view, every cell line requires a slightly (or not so slightly) different set of protocols to keep it growing and healthy to the end of the experiment. Capturing that complex protocol is the kind of thing that makes most developers get over their aversion to free-form text.

So now, converting between the process view and the static view means extracting that one field out of a very complex protocol description (which is probably mostly free-form text). Good luck automating that. (Yes, yes, LLMs… but would you trust it?) A biologist will have no problem doing it, but it involves switching their thinking to the static view. It makes sense that they put this off to the end.

This may be tangential, but it’s also interesting that this complexity makes the experiment planning process kind of confusing from the data scientist’s point of view. If cell lines are interchangeable, you should just pick one at the beginning of the planning process. But if each one requires a completely bespoke and finicky process, you shouldn’t select it until you know how it will fit in with the rest of the experiment process. That’s near the end.

So, this is not to say that a technical solution is impossible. It just requires a form of the process view that is structured enough to be reliably translated to the static view. But creating that structure while also keeping the flexibility required by the realities of biology is… a lot. Which is why there isn’t a silver bullet solution today.

Automation software is a good start. By definition, it has to allow scientists to write down the process view in a way that is both sufficiently detailed and sufficiently structured for a robot to do the work. It seems like there’s been a lot of progress here, but each available option still only captures certain steps or certain types of experiments. (To achieve the structure/consistency, they have to be limited in flexibility.) I’m also closely watching what they’re doing at Briefly Bio, which is trying to capture structured protocols independent of automation.

So I’m optimistic that we’ll get there. But I have no idea when.

Scaling Biotech is brought to you by Merelogic - an independent consulting firm that helps early stage biotech startups get their data under control so they can focus on the science.

Scaling Biotech

It matters which details matter (to you).

Discussion about this post