Supporting analysis isn't one problem - it's 16(ish)
Let’s keep going with my series on the kinds of functionality that biotech organizations need when it comes to data, working backwards from filing an IND. Last time I wrote about storing and tracing results from analysis, so this week I want to explore the step where you actually do the analysis.
Supporting/doing data analysis within a biotech isn’t a single problem - it’s can take many different shapes, based on many different factors. So for this week, I want to go through some of these factors and explain why they’re important. Here are the questions that define them:
Is the analysis custom or generic?
If you’re running the primary analysis on bulk RNA-seq or flow cytometry data, you’re not going to write the analysis from scratch. There are multiple options for scripts that have already been written, and may even be open source. But for the downstream stages of analysis, and many other types of data, there probably isn’t an existing tool that matches your novel science. So that requires a different set of tools. This will often vary with different phases of the same experiment, with earlier stages more likely to be generic, and later phases almost always custom.
Is the analysis exploratory or canned?
By “canned”, I mean analysis that follows a pattern or workflow that you’ve done repeatedly in the past. You know exactly what to do and how it will go. You may even have a script that does it while you get a coffee. That’s a very different situation than when it’s the first time you’ve seen this kind of data. So the same tools are unlikely to work for both. Most organizations will need to handle both kinds, even if the exploratory analysis is just part of developing the canned analysis.
Was the analyst involved in designing and/or carrying out the experiment?
Or in other words “how hard will the people who did the experiment need to work to communicate experimental design and context?” If the analyst was involved in the experiment, or was actually running it, they already know most of what they need. When bench teams work with a separate data team, they must deliberately communicate all of this. Here, many organizations set one or the other as the expectation, and enforce it with headcount: If there are enough analysts to manage the data as it’s generated, they will expect to be involved in every analysis project. If there aren’t, the bench teams will learn to just do it themselves.
Does the analyst prefer to write code or to click around a UI?
The importance of this should be obvious, since it determines whether you build a UI or a code library. But it also has an impact on which problems are straightforward or difficult, as we started to see last time. This is highly, but not entirely, correlated with how involved the analyst was with the experiment: Bench scientists who end up doing analysis tend to use UI-based tools, though there are plenty of exceptions to this, and the number grows every year. In the next 5+ years, I expect to see many more folks who are equally comfortable working in a wet lab or a Jupyter notebook, which means more code-based analysis over UIs.
Before you start designing tools and processes to support data analysis within your biotech, you need to pick what you want to build for, in terms of these four factors. In the upcoming weeks, I’ll go into more detail about what this looks like.