Biotech startups are the craft brewers of data generation
A few weeks ago on this newsletter I kicked off a series of posts about how biotechs can build systems that allow their data scientists/computational biologists to focus on the high priority/high value work. (To read an early version, click here.) So far, I’ve covered what these systems need to do and why efforts to build them often fail. Now I want to start getting into how those requirements and failure modes shape what these systems should look like and how we should go about building them.
This week I’m going to start with the first of the requirements - allowing teams to work reproducibly, ensuring that every analysis can be verified and every insight can be traced back to its source. And to avoid the failure modes that stem from only thinking about the technical layer of the system, I’m going to start with the process layer. But before we can get into that, we need to talk about what makes biotech data different from the data that most tech industry tools were built for.
Most of the databases and other data infrastructure that you hear about was built for big tech companies or for small tech companies that wanted to become big tech companies. So they’re built for the kinds of user-focused data that these companies care about: As users interact with their apps, they generate data in consistent forms that needs to be captured and organized. Every day you get more records of the same handful of types about different users and different interactions. The streams of data are all related to each other by the users that created them, the groups or projects they were working on, etc. You have a limited number of these streams, but each one is expected to grow indefinitely.
In biotech, on the other hand, data comes in from individual experiments which are related to each other, but in more complicated ways that often aren’t captured in a structured form. Sometimes you’ll do the “same” experiment with a few things changed, but this mostly comes later in a startup’s lifetime. So the data from each experiment is effectively a hand-crafted, artisanal, small batch of data.
And when I say small, I mean both technically and in terms of supply and demand. Generating this data is expensive so there are always tough decisions about what to include in each experiment and what to leave out. These are the equivalent of small-run, individually numbered collectors editions of whatever you want to use as your metaphor. And sure, some of these datasets can be hundreds of gigabytes, particularly sequencing data and high content imaging. But let’s be honest - Today’s cloud platforms wouldn’t even break a sweat on any of that. There are a few things that are legitimately big like population-level whole genome sequencing. But very few startups are doing that. The data you have is small.
Because the data comes in these small batches, the processing and analysis of this data usually takes the form of a series of discrete steps, sprinkled with exploratory analysis and visualizations to verify the validity of both the data and the analysis. So each experiment leads to multiple discrete datasets, each of which is a processed and concentrated form of the last one.
So, working reproducibly means being able to find, track, and (if necessary) recreate each data package starting from the instrument data.
The processes, conventions and tools needed to do this aren’t unreasonably complex. But they need to be deliberately planned, and the tools from the tech industry, on their own, aren’t going to do the job. So next week I’ll begin to plot out the conceptual components of a system that makes this possible, then look at options for the technical components to support them. Stay tuned!
Can’t wait for next week? If you want to start building a system that will unblock your biotech data team today, check out Merelogic’s System Design Program. Within 2-6 weeks, you’ll have a functioning system that stops your problems from compounding and lets your data team focus on what they were hired for. The longer you wait, the harder it will be. Get started today!