Don't build a wall between exploratory and canned analysis

Aug 02, 2023

A couple weeks ago, I argued that supporting analysis within a biotech isn’t a monolithic problem - the specifics and the context vary by a number of parameters, and I discussed four of them. For most of these, you’re going to pick one option and stick with it, at least for each team in your organization. But there’s one dimension where you not only need to support the whole spectrum, but also a transition across the project’s lifetime: the level to which the analysis is exploratory vs canned.

The first time you try to answer a particular type of question, most of the analysis is going to be exploratory - figuring out what data to use, what kind of signal is in the data, and even understanding what the question really is. The next time you need to answer a similar question, you’ll probably still need to do some exploratory analysis, but it will be much less - mostly you’ll start from what you did last time and make the appropriate tweaks and changes. The next time you’ll make even fewer changes, and so on.

If you’re a keyboard-first code writing type, you’ll probably be doing all this in Jupyter notebooks or something similar. For each iteration, you make a copy of the last notebook and edit it as appropriate. If, on the other hand, you’re of the mouse-first minimal code variety, you’re probably doing this in something like Spotfire, or maybe even Excel. Again, you make a copy of your spreadsheet/workbook each time and tweak away. Either one lets you work quickly, transitioning between exploration and analysis, with the flexibility to answer exactly the question you’re interested in.

This is all well and good until you’ve gone through this a handful, or maybe even a few dozen time. At some point, you start to notice that copying and updating the notebook is more work than figuring out what tweaks to make. But more importantly, you start losing track of what tweaks you made on previous iterations, which means you can’t reliably compare the different results. Suddenly, the flexibility that made this approach so appealing is starting to look like a liability.

You’ve now shifted to a different point in the grid of analysis use cases, transitioning from exploration-heavy analysis to mostly-canned analysis. The good news is there are once again tools that will let you build repeatable, parameterizable analyses - a crank that you can turn to produce reliable, traceable results. The question is whether you’ll be able to reuse the work that you did in the exploratory phase, or whether you’ll have to re-implement it from scratch. If there’s too much separation between how you manage exploratory analysis and how you implement canned analysis - if you’ve unintentionally built a wall between them - then you’re going to waste a lot of effort re-implementing it.

In other words, it isn’t enough to independently support both exploratory and canned analysis. You also need to plan for and support the transition from one to the other. There are tools out there that do this to varying degrees, but there’s also plenty of room for improvement. But however you do it, your overall system needs to support this process.

Scaling Biotech

Discussion about this post