Can you manage data like code?

Dec 13, 2023

*** A quick plug: The data team at ElevenTx just published a blog post about how they replaced their ELN with Notion and I love everything about it. It’s a very thoughtful and detailed post, and even if you’re not shopping for a new ELN, it does a great job of exploring what an ELN is actually for and how it should fit into a biotech. ***

Based on the number of new subscribers I got in the last week, I think my post about directory structures hit a nerve with a few people. So I figured I'd keep the momentum going and write about a related idea: the models teams use to think about shared data, and thus how they design the infrastructure and processes around it. There are a few different ways that folks approach this, and while I usually try to take an objective, pros-and-cons approach to decisions like this, in this case I think there’s one that’s objectively better. So I’m going to break form a bit and tell you - or at least some of you - that you’re doing it wrong.

There are three types of models I've seen, which I'm going to call shared hard drive, backup and version control. Here they are:

Shared Hard Drive

In this model, you deploy a networked storage device that you can mount to multiple machines, then everyone on the team uses it as if it was an extension of their hard drive. Or at least like an extension of your hard drive that other people can access and change when you’re not paying attention or when you’re in the middle of reading in a file, etc.

The most common implementation I’ve seen is AWS EFS, but you can also do this with on-prem NAS drives or whatever the equivalent of EFS is on GCP, Azure, etc. I’ve also heard of teams attempting to mount S3 buckets, and some SaaS services going to great lengths to make this possible. But I’ve never gotten the sense that it really worked.

As you might’ve guessed, I don’t like this approach. The shared drive is half way between a local hard drive and a file store, and it usually ends up being the worst of both instead of the best: Slower than a local drive, so you end up copying data to local a lot of the time. But also a lot more expensive than other options, so you don’t want to store larger datasets long-term. I wouldn’t do this one.

Backup

In this model, you store the data in a distributed file store (AWS S3, Google Cloud Buckets, Azure Blob Storage), downloading it to your local hard drive when you want to use it, then uploading the results when you’re done. So you get the speed of local files when you’re using the data, but when you’re not using it, you can store it relatively cheaply. Best of both, right?

Except there’s still something missing here. If everyone on the team is just using the file store like a backup drive, or as a way to transfer files (a glorified thumb drive) then it’s going to to turn into a mess. What you really need is a more deliberate approach to keeping the shared drive organized, and here’s where we can learn from the mistakes of our software engineer ancestors.

Version Control

Back in the old days, it was common to treat source code the way many biotech teams treat data today: Drop it somewhere convenient on your hard drive, occasionally backing it up or emailing it to your teammates to work on. Version control systems like git were created to fix a lot of the issues this created, and have now become standard. So why not do this for data, too?

No, I’m not suggesting you store all your data on GitHub/Git Lab. In fact, I’m begging not to. Data doesn’t change the way code does, so version control proper isn’t the right approach. But there are tools out there that are designed for data, allowing you to sync data between your local drive and a shared file store using push/pull dynamics similar to git. (Quilt is a particularly popular one for biotechs.) These tools often take care of tracking the metadata too, so you don’t have to rely on directory structures.

Whether you use one of off-the-shelf these tools or put together your own system of scripts and shared processes, the most important thing is to be deliberate and consistent. Those who do not learn from history are doomed to repeat it.

If you could use some help being deliberate and consistent with how you manage data, check out my consulting company, Merelogic, where I work with biotech startups to design processes, conventions and digital tools to break through the tech problems that come between them and the science.

Scaling Biotech

Discussion about this post