In biotech, data is only half the story.
As we continue working backwards through the stages and use cases that software for biotech needs to support, we’re getting into the neighborhood of one of my favorite topics/borderline obsessions, namely metadata. In the last few posts, I covered finding and accessing data. And if this was any other industry, the step before that would be data collection. But in biotech, we find ourselves in a unique situation where metadata - the data about the data - not only plays a role as important as the data itself, but is collected separately from it. In the next few posts I’m going to cover how the two are collected. But first, we need to discuss how they’re merged back into a single, usable, dataset.
Let’s start by exploring the difference between the two. By data, I mean readouts from an instrument such as a digital microscope, a sequencer, a flow cytometer, etc. The values in these readouts are associated with samples, but the instrument or assay doesn’t know anything about where the samples came from or how they were prepared, so the readout won’t tell you that. Instead it refers to them by a sample id, a well location, a barcode, or something like that.
When you analyze this data you don’t care about barcodes and sample ids, at least not directly. You care about the differences in sample preparation - treatments, concentrations, time points - all the information that the readout doesn’t know about. That’s the metadata.
From a technical perspective, replacing those ids with metadata should be pretty easy. In fact, there’s a name for it. If the instrument gives you a table with a “sample id” or a “well id” or a “barcode” column, and you have a metadata table with a corresponding column, lining those two tables up along those columns is called a join. The resulting table maps the sample preparation to the readout. Done.
And yet.
The problem is that when someone in the lab prepared those samples, they weren’t thinking about how you’re going to join the metadata to the data. They were thinking about how to get the experiment done on time. They were thinking about whether the instruments would work. They were thinking about a thousand tiny details that could make everything go wrong. So they put in the minimum possible effort to collect metadata because anything more would have been an irresponsible distraction.
Now you have the readout, but where’s the metadata table? That’s a problem, and the opportunity to solve it passed a few days or weeks ago when your colleague started the experiment. (These experiments are slow.) To make that join possible for the next experiment, you have to make it so that the minimum effort they exert to collect metadata in the future will get you the detail and consistency you need to day. And that means some combination of changing what they consider the minimum possible effort and giving them the tools so that minimum effort gets you what you need. That’s much harder than doing a table join. But it’s a topic for another day.