It's OK to put your data in multiple places. (In fact, it's good.)
I missed a step! As you may already know, I’ve been going through different data-related use cases that a biotech startup’s software needs to support. But I’m going through them backwards, starting from filing an IND, with the idea that we can better understand each step because we’ve already covered the step that it feeds into. Well, it seems that’s also making it trickier to keep track of the steps. Last week I wrote about finding data and before that I wrote about doing analysis with that data. But in between is the step I missed - accessing the data. So I’ll cover that this week.
We all know how important it is to have the right data architecture, including the right kinds of databases, the right schema, etc. But how do you tell if it’s “right”? Well, as far as I’m concerned, this is it - the whole point of carefully designing your data architecture is so that you can access the data quickly and efficiently for the particular analysis you want to do.
But of course, as we recently discussed, analysis isn’t just one use case, it’s 16(ish)! And even if 16 isn’t the right number, the point is that there are a lot of different parameters, each of which will have an impact on how the data should be accessed, and thus how the data architecture should be designed.
I’m not going to go through all the different possibilities, since that would make this much longer than you probably want to read or I have time to write. But I do want to make two points:
First, I think that the most important distinction when it comes to data access is batch vs transactional access. If you want to look up all the assays that were run on a single compound, that’s transactional - you’re looking at multiple tables (assays, compounds, maybe a few in between) but only looking at a small number of entries in each table. If you want to train an ML model on compound structures, that’s batch - you need to look at every single entry in the table, all together.
Each of these two access types is best supported by storing the data in a completely different way. For transactional access, you’ll want a proper database that’s designed for these kinds of lookups. For batch access, you’re more likely to want flat files that can be read into memory locally or sent to remote workers. So that’s fine if you know you’re only going to do one or the other. But of course that’s never the case - you’re going to need to cover both.
And that’s OK because of the second point: it’s perfectly fine to store multiple copies of the same data in different formats. You’ll want to be clear about the source of truth, and if the data is changing you’ll want to be very restrictive about how those changes propagate between copies. But the alternative - trying to find a single architecture that supports all 16(ish) types of use cases - is effectively impossible.
Don’t even try.