There's more to keep track of than just the answer.
*** A quick ad/reminder: I recently rolled out a System Evaluation program that uses a rubric based on the system template I’ve been writing about to create a report that tells you exactly what your team can do today to be ready for tomorrow. Apply today to see if this program is right for your biotech startup. ***
Last week, I broke down the process of running an analysis/data transformation job into eleven steps. This week, I’m going to start breaking these down into the technical tools and manual processes needed to support them. These fall into two basic categories: 1) The things that need to be recorded, and thus recalled later on and 2) The tools that either support or automate the process of storing and later retrieving this data.
The types of tools that can be used, and the specifics of how data is stored are very implementation dependent - They evolve as you make each step more deliberate, consistent and automated. But across these different levels of implementation, the thing that’s consistent is the content of the information that’s being stored. So this week, since I’m still talking about abstract components, that’s what I want to explore. I’ll go a little into implementation options for the sake of illustration, but I’ll mostly save that for next week.
Step 0: Define the goals and approach
There are two parts to this step: First, identifying the criteria that trigger the need to run the job. Second, translating this into goals and an approach. So we need a component for keeping track of events across the system and a set of rules for how this component translates these events into analysis triggers. We’ll call this:
Component: Event Queue
In most organizations, this is mostly implicit - It’s a collection of institutional knowledge distributed across a nebulous collection of teams and employees, coordinated by emails, meetings and chance encounters by the water cooler. That works well enough for most types of events most of the time. But once these events become common enough that they need to be handled consistently, it’s time to start thinking about a more deliberate implementation.
If the goal is to reproduce an earlier analysis job, you’ll also need to look up that information. But I’m going to leave the discussion of that component to Step 9.
Step 1: Load the appropriate compute environment
Depending on what kind of analysis you’re doing, and whether you’re trying to reproduce an earlier analysis, you need a way to load a consistent compute environment. We’ll call this:
Component: (Compute) Environment Registry
As with the event queue, this often starts with institutional knowledge, though a relatively easy next step is a shared document with instruction for setting up your laptop. This can evolve further to installation scripts and even Docker images or a managed, Cloud-based compute platform.
Step 2: Prepare the script/notebook/config for the analysis
If you’re starting from scratch, this is just a matter of creating a new script or notebook. Eventually, though, you’re going to want to start using scripts or at least notebook templates. We’ll call the place where you look these up:
Component: Job Registry
This is probably going to be one or more git repos, and it doesn’t have to evolve much beyond that. But the level of organization and consistency that you impose on these repos can make a big difference, particularly if you organize them to be more compatible with the other components of the system.
Step 3: Identify and find source datasets
How you do this depends on how you defined the goals and approach in Step 0. But no matter what, you’re going to need:
Component: Data Registry
This often starts out as nebulous institutional knowledge - whoever wrote/created the data hopefully remembers where they put it. (They can definitely find it if you give them long enough, right?) More deliberate implementations can range from a shared spreadsheet to a proper data catalog or something like Quilt.
Step 4: Copy data and metadata to the local environment
You’ve found the data, but we need a name for that place you’re copying it from. That’s:
Component: Data Store
This is a shared drive where the actual data of the datasets is stored (as opposed to the metadata in the Data Registry). Note that I’m also distinguishing this from a Database which is where data from different datasets are aggregated into repeatedly updated tables. Because the database is mostly for aggregating datasets, I bucket it under decision making which I’ll cover in a later post.
Step 5: Create a local dataset
Since this doesn’t change any shared resources, this doesn’t require a new system component. This is also true for the next step:
Step 6: Run the transformation/analysis
Step 7: Push the new data to the data store
This uses the Data Store again, but we already defined that above.
Step 8: Register the dataset metadata so you can find it again
This uses the Data Registry again.
Step 9: Register the process metadata so that the process is reproducible
We need a way to record how the analysis was carried out and associate this with the dataset it produced. I alluded to this component in Step 0, but now it’s time to define it:
Component: Run Registry
Like the dataset registry, this usually starts as institutional knowledge. More deliberate implementations depend on how the analysis was run. If it was all done in notebooks, you can just make sure the notebooks are checked into version control in a place where you can find them. (And make sure your environment registry is robust enough that you can run them.) If you’re using scripts, it’s a matter of storing the parameters that were passed to the script - in version control, in a spreadsheet or database, or somewhere else.
Step 10: Clean up the compute environment
Like Steps 5 and 6, this doesn’t use any shared resources so there’s no new component to define.
But there is one more step that I realized I missed last week:
(Bonus) Step 11: Register process completion
Tell the Event Queue that you’re done. If the event queue is the usual nebulous pool of institutional knowledge and conventions then this probably means sending someone an email. If it’s something more structured, even better.
Conclusion
Hopefully this post gave you a good idea of the types conceptual system components that you need to support the process of running analysis. Because a lot of them tend to be implemented as informal institutional knowledge, it’s easy to overlook them. But even if these informal implementations are fine, particularly for early stage startups, I think it’s important to be very deliberate about each of them. Next week I’ll begin to explore how to think about the level of structure for each component and how to evolve them to the level that’s right for your team.