Tracking analysis starts with process but ends with tooling

Jul 12, 2023

This week, it’s back to my series of posts on the kinds of functionality that a data-focused biotech needs from their systems. Last time I wrote about gathering all the evidence that you've collected into a narrative that allows you to make and document the decision that go into an IND. This week I want to explore how that evidence gets there in the first place - that is, what your data scientists or bench scientists do when they're done with their analysis and want to share it.

I think it’s worth splitting these functional requirements into two buckets. The first is consistency: We need to ensure that the outputs reliably end up in a place where we can find them, following organizational conventions so that we will know what they are later. The second is provenance - We need to be able to trace every output to the experiment or algorithm that produced it. Following a common theme on this newsletter, one of these is more of a process problem, while the other is more technical.

For enforcement, you can train users to follow conventions, but there will always be typos, new users who haven't learned the conventions yet, and old users who have forgotten. So it's best to automate as much of this as you can. If your users are writing code or using internally built tools, you can program in conventions that fit the specifics of your pipeline. But for off-the-shelf tools, it’s hard to build in enough customizability to cover those specifics. So they tend to build in flexibility instead. This is the best option under the circumstances, but it means you’re back to training users.

When it comes to provenance, the goal is that someone else in the future will be able to review what was done, check that it produced the results that were recorded, and (ideally) apply the same analysis to future datasets. Today, there are standard ways to track analysis, namely version control, and an abundance of ways to track data. These are separate systems because data and logic change in fundamentally different ways and at different times. So provenance is a problem of coordinating these two systems.

Here, off-the-shelf tools actually have an advantage over custom solutions. Once you’ve solved the enforcement problem, tracking provenance becomes more or less the same at every organization. The complexity is much more about technical problems, which means a software vendor can deal with headache once and solve the problem for all their customers.

So when it comes to tracking data and analysis, the build vs buy decision is more or less a wash. Building makes it much easier to automate enforcement, while buy will make it much easier to track provenance. The decision ultimately depends who’s doing the analysis, and what tools they’re using. But more on that next time.

Scaling Biotech

Discussion about this post