Data infrastructure is like a joke - It's all about timing

Oct 11, 2023

When I started writing this newsletter a few years ago, most of my experience in Biotech was with series B/C startups, so I mostly wrote about what startups at that stage need. In fact the word "scaling" in the name refers to a major concern at series B that is only a distant thought at the A, seed and pre-seed stages. But since more recently spending more time with these earlier startups, I have a better understanding of how this evolves, and this week I want to discuss the different needs at different stages when it comes to infrastructure, data and particularly metadata.

I like to think of it in terms of four stages defined by the primary concern at the time. Each time you hit an inflection point that takes you to the next stage, the concern doesn't go away, you just layer a new concern on top, kind of like a Maslow hierarchy. These four stages correspond roughly to pre-seed, seed, A and B, but of course it all depends on the particular startup (and what investors are asking for this week.)

The stages are: Reliability, Reuse, Consistency and Efficiency. Let's go through each one.

Reliability.

Every biotech starts with a hypothesis that is risky enough that no one else has been willing to test it. So for the first phase, your goal is to test that hypothesis to decide if it's worth investing more time and money (yours and your investors’) in the science. If the answer is no, you're going to throw everything away. So you really want to avoid long-term investments like data infrastructure. Lucky, if the answer is yes, you're going to run confirming experiments with more accurate assays on anything you discover (and still throw away the original data), so that's OK.

This also applies to things like assay development that you do at any stage, where you're going to use the data once, then never look at it again. The data needs to be reliable because you're going to make some expensive decisions with it. But beyond that it just needs to be fast and cheap.

Reuse

Once you've made the go decision, your next goal will be to show that those early experiments weren’t a fluke. This is often around when you raise a seed round. It's also when you start generating data at a scale and reliability where you don't want to ever have to run the same experiment again. And even if you don't think you'll want to look at that data again in the future, someone you haven't hired yet probably will.

I have spent weeks of my time at series B startups chasing down hints and rumors about datasets from 6 months ago before accidentally stumbling into the one person who knows where it is. (In one case, this was an off shore contractor who we happened to start working with again.)

You don't know what they're going to want to do with it, or what form they'll expect it in. Your assays and protocols are still evolving anyway, so they're going to need to do some clean up no matter what. All that matters is the the data (and metadata) is in a place they'll be able to access, and that they know where to look for it. This could be as simple as a shared spreadsheet with a URL or a file path, as long as it's the only such spreadsheet, it's consistently updated, and everyone knows to start there.

Consistency

Once you've shown that those early results are reproducible and you have an idea of what assays and experiments you’ll run to explore further, you’ll also want to start comparing results between different runs of the same assay. At first, this may only be a handful of runs that only come in every few weeks, so automating is going to take more effort than it will save you. But you need to make sure that the data ends up in a consistent enough form that you’re not spending half your analysis time fighting with data issues. This tends to coincide roughly with raising a Series A, but again it depends on the startup.

Efficiency

At some point, the rate at which you’re running experiments and generating datasets will get to the point where doing some work to automate the process will save you more time and effort than it will cost. This scale-up tends to happen around Series B. And when you look at where this inefficiency happens, you’ll notice that a lot of it has to do with cleaning up metadata.

At each of these stages, investing too much in data infrastructure is a waste, but not investing enough will make it much harder to transition through the next inflection point. The key is to plan and build for these transitions so that both your processes and tooling are ready to be updated when you are. I’ve written previously about building infrastructure in a way that allows this, and I’ll return to this idea regularly as we continue to explore the software that supports biotech research in upcoming posts.

Scaling Biotech is brought to you by Merelogic - an independent consulting firm that helps early stage biotech startups get their data under control at every inflection point so they can focus on the science.

Scaling Biotech

Data infrastructure is like a joke - It's all about timing

Reliability.

Reuse

Consistency

Efficiency

Discussion about this post