Reader question: Pipeline frameworks for NGS and non-NGS workflows
*** Before we get started, a few quick plugs:
The deadline to fill out the Bits in Bio member survey has been extended into early December, so if you haven’t filled it out yet, there’s still time. It only takes a couple minutes and will help us make BiB into a community that supports your needs.
I recently appeared on the Data in Biotech podcast to talk about all the usual topics. Check out my episode here, then listen to all their other great episodes.
***
Natalie Ma recently sent me a really interesting question about workflow frameworks, and since she happens to be a subscriber to the newsletter, I thought this might be an excuse to start having occasional posts where I answer reader questions. If you have a question or a topic that you’d like me to write about, send it to scalingbiotech@substack.com and (assuming it fits with my usual themes) I’ll write an answer in an upcoming post. (You can choose to be named or anonymous - Thank you, Natalie, for agreeing to let me use your name and your question.)
Here’s the question (slightly paraphrased): How should biotech teams think about bioinformatics pipeline/workflow frameworks, particularly Nextflow, as they scale.
It just so happens that most of what I know about bioinformatics came from trying to migrate a bulk RNA-seq workflow from a collection of shell scripts into Luigi, a Python-based workflow framework. (Emphasis on “trying”.) The advice that I ignored before starting this project was to just use an off-the-shelf workflow that had already been implemented in NextFlow. If I could do it over again, I would take the advice. But I think it’s worth exploring why I initially ignored it before getting into why that was probably a bad idea. (I also asked this question on the Bits in Bio #embedded-data-teams Slack channel to make sure I wasn’t too far off the mark. Head over there to see what everyone else said.)
I took on this ill-fated project early in my biotech career for a startup that was training ML models using RNA sequencing data. We wanted to automate the primary analysis pipeline, then extend that workflow to include some of the early stages of ML feature definition and training. So it seemed to make sense to run everything with the same framework. And when it came to the decision between writing all the ML workflows in Groovy for NextFlow or rewriting the bioinformatics workflow in a Python-based framework, Python won.
In hindsight, I think the better question to ask was whether having a single framework end-to-end is more important than having the appropriate framework for each piece of the pipeline. Nextflow is optimized for long-running, high-memory, monolithic pre-built binaries. The framework provides fault tolerance but largely treats each step as a black box. Frameworks like Airflow, Prefect and Luigi are designed for running custom code in parallel. These frameworks don’t just run black-boxes - they package your code for you, removing a lot of the development overhead and leading to a more integrated result.
It’s not that hard to have a pipeline in one framework trigger a follow-on workflow in a different framework. Yes, it adds complexity and overhead, as does maintaining the infrastructure for each framework. But writing a workflow in a framework that wasn’t designed for it creates arguably much much more. It’s a trade-off.
So in the end there isn’t a simple answer - which route will lead to less overhead and fewer headaches depends on a lot of factors. But I can tell you from personal experience, it’s easy to underestimate the effort required to implement (and more importantly debug) a bioinformatics workflow in a new framework. Think twice before you decide it’s worth it.