Running an analysis job is more than just running analysis
*** A quick ad: I’ve talked to a number of data science/comp bio leaders who have gotten their data system to a good place, but want to make sure they’re ready for whatever’s next. So I created an evaluation rubric based on the framework I’ve been writing about here and I’m using it to roll out a System Evaluation program. Sign up and I’ll use the rubric to create a report that tells you exactly what your team can do today to be ready for tomorrow. ***
Alright, we’re in the midst of a series exploring how biotech startups can build systems that allow their data scientists/computational biologists to focus on the high priority/high value work. Last week, I described the typical nature of data in biotech - small/discrete batches of data that move through a series of processing steps before becoming ready for interpretation and decision making. My goal for this and the next few posts is to begin building a picture of the system components that are needed to support the processes around this data.
But before we get there, we need to understand the processes in a bit more detail. In fact, what you’re about to read may seem like an unnecessarily pedantic level of detail. But I think this is necessary to understand the complexity involved. Otherwise, it’s easy to leave important details out of your system design that will come back and bite you later.
For the Reproducibility module, there’s one main process that we need to understand, which is transforming/analyzing one or more datasets to generate a new dataset. This may mean reproducing a previously run analysis, or (more often) it could be running a novel analysis. I’ll define a single template for both that accounts for both options.
The Process
I’m going to break this into eleven steps (I told you this would get a bit pedantic.) where steps 0-5 are setting things up, step 6 is the analysis and steps 7-10 are saving/registering things and cleanup.
Step 0: Define the goals and approach
Maybe this doesn’t feel like it should be part of the process - it happens before the process even starts, right? That’s why it’s step 0 instead of 1. Usually this is some kind of informal discussion or an email. But there are a couple of cases where it’s important to deliberately consider this step:
The first is when you automate the analysis. In this case, Step 0 means explicitly defining the trigger that kicks off the process and the logic for how the trigger translates to parameters, source datasets, etc.
The second is when you’re reproducing an earlier analysis, in which case this is where you look up what was done in that earlier analysis. To make this possible, any information that you need to set up the analysis in steps 1-5 needs to have been recorded in steps 7-10 of the original run.
Step 1: Load the appropriate compute environment
The compute environment could be anything from your laptop to a virtual machine managed by Airflow running on Kubernetes in AWS/GCP/Azure/etc. In the later case, this step includes whatever process you need to spin that up. But in both cases the trickier part is making sure that the right libraries are installed, directory structures are in place, environmental variables are set, etc. Again, there are a range of ways to do this from a (detailed) document in Sharepoint/Drive/Notion/etc. that describes the process for a standardized environment to a Docker image. What matters is that it allows you to record what you did so that someone can reproduce it later.
Step 2: Prepare the script/notebook/config for the analysis
This might mean anything from opening a Jupyter notebook and writing some code to creating a config for a pre-defined script. If it involves custom code, there’ll probably be some exploration and iteration, in which case this will get entangled with the next few steps. But this is conceptually separate, even if not temporally separate.
Step 3: Identify and find source datasets
How you do this depends on how you defined the goals and approach in Step 0. But you need to do it one way or another. Nothing controversial here.
Step 4: Copy data and metadata to the local environment
Conceptually very simple, particularly once you have Step 3 sorted out. But if you’re worried about reproducibility, there can be some tricky technical details. In fact, I wrote a whole post about it.
Step 5: Create a local dataset
Depending on how you’re tracking you’re datasets, this might just mean creating a local directory, or it could involve something more. Shouldn’t be hard, but we can’t forget about it.
Step 6: Run the transformation/analysis
This is the step most people think of when they think about doing analysis.
Step 7: Push the new data to the data store
Essentially the reverse of Step 4. If you have that covered, you’re probably good with this one. Just make sure you know where you’re supposed to copy it to.
Step 8: Register the dataset metadata so you can find it again
If you have a good system for managing your data like code, this and Step 7 should really be one step. But if you’re not then it’s really easy to forget about this one. So this is here to remind you.
Step 9: Register the process metadata so that the process is reproducible
This is one that even more people forget about. As I mentioned above, the goal is to allow you to re-create steps 1-6 at some point in the future. And even though you could probably do it blindfolded right now, in three months you will have zero recollection of any of this. So Step 9 is a favor to future you. But you should approach it as if future you is someone with zero context. (Because you will be.) So any information that a person with zero context would need to recreate Steps 1-6 - You need to record that in a place where a person with zero context will be able to find it.
Make sure your notebook is in version control, in a place you’ll be able to find it. Make sure you recorded what libraries you installed in Step 1. Make sure you can find the input data again.
Step 10: Clean up the compute environment
If you did this on your laptop, this may be a no op - you can probably keep those folders around. If you did this in the Cloud, you may want to turn off your VM before you give your finance guy a heart attack. This step doesn’t always matter, but when it does you shouldn’t forget about it.
Conclusion
So, this turned out a lot longer than I like to make these posts. But as I noted above, my goal was to be thorough, at the risk of being pedantic. Plus I made a promise to keep this newsletter mundane. If you made it through all that, well done. Next week I’ll use the steps here to identify the conceptual components that you’ll need to support with technical components to make this process work. Stay tuned!