The three core questions every biotech data team must answer

Nov 15, 2023

Every biotech data team will at some point face the question “Where is the data and where did it come from?” Sometimes it’s when a new team member joins and can’t wait to get started. More often it’s when a team member leaves, and the team suddenly realizes their bus factor is 1. The data is scattered around hard drives and shared drives and everyone on the team uses a slightly different collection of tools and libraries. Finding data is a matter of asking the right person and figuring out how to reproduce it is a matter of archaeology.

This is the kind of situation that gets harder to address the longer you wait. And, in fact, addressing it is mostly a matter of making three key decisions (maybe easier said than done?) and documenting what you decided. In this post, I’ll go over what these decisions are and the most common options to choose from.

This is an exercise you can go through with your team over a series of meetings. But if you want a little help, you can also apply for my Biotech Data Tools Clinic where I’ll walk your whole team through the process. (I was previously calling this the Data Reliability Workshop but I think that name was too ambiguous.) While I’m rolling out this new program, I’m offering a 40% discount for anyone who applies by November 22. Check it out and sign up today!

Here are the three question:

1. What compute environments will you use?

I’ll break this down into two more specific sub-questions:

1a) Where will you run analysis, scripts, etc?

Options include laptops, Cloud-based VMs (EC2, GCE, etc.), Serverless systems and hosted notebook SaaS products. You’ll probably want to have different options for different situations like interactive analysis vs developing software vs automated runs. Depending on your situation, some options will be better than others. Some will be a toss up. What matters is that you limit your team to a small number because that will help with the second sub-question:

1b) How will you make sure members of your team can reproduce each others’ work?

This is partially a matter of deciding which tools/languages/libraries to use. (Python or R? Tensorflow or PyTorch?) But it’s also a matter of deciding how you’ll ensure that everyone sets up their environment consistently. This could be a lightweight approach like an instructions doc or a setup script, or something more production-grade like Docker images or AMIs. If you use a hosted notebook SaaS product, they’ll have their own way of doing it, but you may need to make decisions about how to use their solution - libraries, naming schemes, etc.

Migrating from an imperfect option to a better one is generally easier than waiting and starting from scratch. A clunky install script will at least tell you what needs to go in the Docker image, precluding the archaeology expedition. So just make a decision. You can always revisit it later if you need to.

2. How will you store (and later find) data?

Again, this can be broken down into a number of sub-questions:

2a) What kind of shared storage will you use?

Options include Cloud buckets, Egnyte, Sharepoint, Google Drive, etc. Many of these can be accessed both through a graphical interface and through an API. But they’re typically optimized for one or the other. The option that’s best for you will depend on which one is more important to members of your team. Some SaaS platforms come their own option for shared storage, which answers this question, but not necessarily the rest.

2b) How will you transfer this data to local drives or access it remotely?

If your storage solution has an API, you can use scripts or a CLI. But will you design a custom shared library, or let everyone write their own? Will you use something off the shelf like Quilt?

2c) How will you organize data in the shared drive?

By date? By project? By data type? By a unique, non-interpretable hash code linked to an external database because putting metadata in path names is bad? All valid options (even that last one) as long as you’re consistent about it. The best one will depend on your science and your pipeline, so you’ll have to come up with your own answers no matter what.

2d) How will you keep track of what data you have?

It’s tempting to answer that you’ll scroll through folder names, but that only works if you did a really good job on 2c, and it never works for long. You’re going to want something that acts more like a database. A shared spreadsheet works well enough at a small scale. A database works better at larger scales. But that spreadsheet is infinitely better than nothing.

3. How will you track the analysis and scripts you’ve run?

This one is essentially about version control, so the answer will probably involve git in some way or another. There’s eternal question of GitHub vs GitLab, but that ends up being less consequential. The hard part is deciding how you’ll divide the code up into repos, what conventions you’ll use for organizing those repos, and your processes around managing the repos such as code reviews. (You are doing code reviews, right?… Right?)

Different parts of your code base will have different characteristics when it comes to how fast the code is changing, how you run the code, and how reusable it needs to be. There are many different kinds of analysis, each with different requirements when it comes to managing (and reviewing!) the code. That often means different repos, but sometimes not. What matters most is that you and your team are consistent about it.

Conclusion

The thing about these three questions is that deciding and documenting them will get you a long way before you even start doing the technical work. Sure, some of the options will require you to build or deploy some infrastructure. But most of the benefit comes from having consistent practices across the team and a document that spells them out for the new folks.

The sooner you get started the better. (And if you need help getting started, I’ve got you covered.)

Scaling Biotech

Discussion about this post