This week, continuing with the theme of problems that many biotechs face at specific points in their evolution, I’ve got one that feels deceptively easy and therefore often catches teams off guard. Whether we’re talking about small molecules, biologics or something else, most biotechs will eventually end up with a collection of things that they want to screen with different assays, then pick some to move to the next stage. In drug discovery, these stages are often called hits or leads or something along those lines. So I’m going to call the place where you collect and review this data a hit dashboard.
On its surface, this shouldn’t be that hard. A hit dashboard is just a table, right? A row for each molecule/sequence/etc. and a column for each assay/readout. There are high-tech and low-tech ways to create and share/display a table like this. You’ll probably want it to be sortable. Maybe some images of molecular structures. But at the end of the day, it shouldn’t be that hard.
And yet, it still manages to trip up many biotech data teams.
Of course, the hard part isn’t making the table. The hard part is getting the data into the table. But even there it can be deceptively difficult. Tracking down a single dataset may not be that hard, but the whole point is to look at multiple assays, and thus multiple datasets.
These different datasets were collected at different times from different kinds of experiments, and most importantly by different people. Each person was focused on collecting data to answer a single specific question. Now you want to use their data to help answer a (slightly) different question. And that’s where it gets tricky.
The first layer of the problem is where they put the data. If you’re lucky it’s in a shared drive somewhere. If you’re slightly less lucky, it’s on their laptop and they remember which folder it’s in. If you’re even less lucky, you may have to copy/paste from a table in a slide deck. So if you want to automate this hit dashboard, you’re going to have to somehow get the data into a consistent place.
But that turns out to be the easy part. We haven’t even dug into data formats yet. Some of the results are just a single number. Some of them are two or more numbers that are both required for interpretation. Almost all of these numbers need to be normalized based on a positive and/or negative control, which may or may not be interpretable across batches. And this is even before we get into concentration-response curves, outliers and IC50/EC50s.
To be clear, these are all issues that can be overcome and that are regularly overcome one way or another. The problem is that it’s just complicated enough to require careful thought each time you add a new column to the “just a table”. If you have a more high-tech solution then the data scientist or engineer who knows how to do it becomes an unwitting gatekeeper. And if you try to replace them with a self-serve option, you run the risk of re-inventing Spotfire/Tableau/Excel.
So the much more common solution is that the bench scientist who needs that information in a hurry ends up manually copy/pasting it into Excel. (At a few startups I’ve talked to, it’s the CEO that does this.)
This gets the job done, which is why they keep doing it. But even if you overlook the amount of time it takes, the number of opportunities for error is astounding. It can’t be automated so it will often be out of date. And, of course, there’s no way to verify the work, let alone any sense of reproducibility.
So yeah, the manual approach works. And there doesn’t seem to be an easy alternative. But if we’re going to build data driven biotechs, we need to find a better way.
What about some combination of a "data lake" (with version control) and an open source dashboard application, like Shiny (https://rstudio.github.io/shinydashboard/) for building a real-time, updated dashboard?
There is a better way Jesse.
Analyzing the Potency and Stability of Bioassays Webinar, May 9th, 11:00 a.m.EST.
Learn how to use JMP to analyze biological assays during pharmaceutical and biotechnology development to help prepare for regulatory submissions, develop standard operating procedures and maintain good manufacturing practices. See how to use build and analyze parallelism between a pair of doses and responses, estimate relative potency, determine when a drug dose or concentration reaches toxicity, determine the affinity of the enzyme to the substrate, identify the 50% dose concentration (EC50) necessary to cause half of the maximum possible effect, and determine equivalence of different formulations.
This webinar covers: Fitting non-linear models, Parallelism Tests, Relative Potency Estimates, Equivalence Tests, Inverse Prediction for EC50 and IC50, and estimating Vmax and Km.
register here:
https://community.jmp.com/t5/Learn-JMP-Events/Analyzing-the-Potency-and-Stability-of-Bioassays/ev-p/710640