The 3 things data infrastructure needs to do
So, right on the tail of last week’s newsletter opening where I joked about writing listicles, it seems that this week is actually going to be a listicle of sorts. Though in my defense I think you should need more than three things to qualify as a listicle. But I’ll let you be the judge.
First some context
Based on my first few projects through Merelogic, I’ve put together an approach to building biotech data infrastructure that I believe can consistently eliminate technical obstacles and free up data science/comp bio teams to focus on what’s important. In addition to creating services to help startups with this through Merelogic, I’m working on a long-form guide to the process that I’m planning to release publicly in the next few months.
In the mean time, I’m planning to write about most of what’s in this guide little by little in these weekly posts. But I’m also looking for early readers of the guide itself, who can give me feedback to make it even better. If you’re leading a data team at a biotech startup and want to get an early look at the guide, fill out this form (or contact me through other means) and I’ll send you a copy.
Now the listicle
A large part of building data infrastructure for a biotech startup is knowing what not to build. When you’re building the tracks ahead of a moving train, everything feels urgent. Everything is a priority, which means nothing is a priority. So the first step is to decide what matters and what doesn’t. Below are three things that you absolutely need to get right. And if you can get these right, much of the rest will follow.
This list is based on my earlier post on what startups need at each stage of development. So depending on where you are today, you may only need the first one or two of these.
Your data infrastructure needs to enable your data team to:
1. Work reproducibly
Your system needs to ensure that every analysis can be verified and every insight can be traced back to its source. So you need a way to not just catalog the data but also keep track of where it came from and how it was produced. It should be possible for anyone on the team to rerun earlier analysis, verify that the output is the same, then modify it or apply it to other datasets. I’ve mentioned managing data like code previously, and that’s a large part of it.
2. Drive decision making
Because biology data is noisy, you usually need to combine multiple sources of information to find real signal among the noise. So most of the significant decisions that your scientists make require combining multiple datasets. In other words, to drive decision making, your data systems must allow you to merge data from multiple assays and sources so your data team can share the right information at the right time, in a form that the entire organization can leverage.
3. Scale
Because you’re a startup, you expect everything you do to grow, and for the growth to only accelerate. That means 1. and 2. are only going to get harder. To scale, you need to address the bottlenecks, and if you’ve been following this newsletter, you probably know what I’m going to say next: The bottleneck is almost always metadata. So for your system to scale (when your startup gets to that point) you need robust systems for getting data and metadata from their sources - your lab, CROs, public datasets, etc. - into a form where your data team can work reproducibly and drive decision making.
Obviously there’s a lot of nuance and complexity to all of these. And there’s a big gap between saying that these are the priorities and actually building the system that supports them. But as you’ll see in the coming weeks, you can use these three priorities to organize the overall system based on which functionality it supports. And if a component doesn’t support any of them, it doesn’t need to be in your system.
But that will have to wait for the coming weeks. Stay tuned for more!