Automating data pre-processing requires (You guessed it) metadata
Recently, on this journey backwards through the user cases that biotech software needs to support, I discussed custom/interactive analysis. But when you get to the point that you're seeing the same kinds of experiments, readouts and analysis over and over again, you'll probably want to start automating the early/pre-processing stages of this analysis - evaluating data quality, running primary analysis, etc. This feels like a natural extension of the data transfer stage (which I'll cover next week). But it turns out the biggest bottleneck to making this happen usually isn't the data but rather, following the recurring theme on this newsletter, the metadata. And that’s what I want to discuss this week.
As a technical problem, automating analysis is - well, I’ve learned never to say that a technical problem is “easy” - but it’s at least a well explored problem with multiple established, accessible technical solutions. If you’re using AWS, it’s lambda triggers and Batch jobs writing to S3 or your database of choice. If you’re using one of the other Cloud platforms that I don’t know as well, there are equivalent services. If you’re using on premise compute… uhm… you probably shouldn’t be? But even then, there are technical solutions.
So, once you have the technical framework in place, the next step is to figure out the process that your data scientists/computational biologists/etc. are currently doing manually so that you can write scripts that do the same thing. And nine times out of ten, the first step they do is to track down a sample sheet or a plate map or some equivalent, and clean it up into a consistent form.
That’s right - even these very early, very simple steps require metadata. To create a quality report on your sequencing data, you’ll want to compare house-keeping genes across samples, which means you need to know what the samples are. To extract features from your HCI data, you need to know which channels are which so you know how to segment them. The list goes on.
So, maybe that means the first step in the automation should be asking ChatGPT to find and clean up the sample sheet. And sure, ChatGPT has an API that you could call. But even if you trusted the results, I don’t think we’re there yet.
Assuming you don’t want to have a large language model solve your metadata problem, you need the metadata to already be in a consistent place and form by the time data comes out of the instrument and triggers your automated pipeline. And in theory that should be doable - someone has known that information for days or weeks by then. They knew it before they started the cell culture and picked up the first pipette. It’s just been stored in slide decks and meeting notes or maybe an excel file on someone’s laptop.
All this means that the technical solution isn’t enough to start automating these early stages of analysis. You need to flip how your team manages metadata. Instead of someone waiting until the instrument spits out data to start pulling all the pieces together, you need to do that before the instrument even starts running, ideally long before. Start registering experiments in a central, consistent place from the start. Then when the data hits, the first stage of your automated pipeline is to look up the experiment. After that, the rest is… well, not easy, but doable.