Don't neglect your data model

May 03, 2023

In my post last week, I claimed that you can break down a digital twin of the lab into two halves, the model and the implementation, both of which are hard (but in different ways). In this post, I want to start discussing the model, since that has to come before the implementation. And I’ll start by breaking it down even further into three kinds of models: The data model, the decision model and the workflow model. This week, I’ll explore the data model, which is the static picture of how you organize your information. It’s sometimes a called a schema or an ontology. Either way, it’s representing the lab and beyond as structured data.

Now, with all the recent progress on Large Language Models (LLMs), a number of people are arguing that soon we won’t need structured data anymore - We’ll just feed any scraps of minimally-organized information into some future incarnation of GPT, and ask it natural language questions to extract whatever we need. And sure, maybe this is where we’re headed. I’ve made enough bad predictions in my life to know better than to say “never”. But we’re not there yet, so I’m going to write in terms of the technology we have today. One way or another, this post will eventually be obsolete, and sure, maybe that will be the reason. Why not?

There’s decades of literature out there about how to organize information into structured, relational data. And while I like debating the relative merits of second vs third normal form as much as I’m sure you do, I’ll leave that to others for now. Instead, I want to explore what happens when you don’t get the data model right - when there’s a mismatch between how your organization thinks about data and how the system stores it.

This can happen at any level of detail, down to the intricate details of samples and plate maps and reagents. But to make the discussion a bit more accessible, I want to look at a more high level example, which is how you organize projects and experiments. Let’s say your organization has a handful of drug programs, each of which carries out a sequence of studies, with each study consisting of a collection of experiments. And let’s say your ELN has a concept of experiment, and lets you assign experiments to projects, but doesn’t have an intermediate concept of study. How do you keep track of which experiments are in which study?

You could track this outside the ELN, maybe in a shared spreadsheet, and somehow make sure everyone remembers where it is and keeps it up to date.
You could include the study name in the experiment names, and train everyone on the team to consistently use and interpret this convention.
You could just drop the notion of studies since it’s not worth the effort.

None of these are good options. In fact, the only one that’s likely to “work” is the third. And even if you don’t have this particular problem, you probably have something similar, plus many more at every level of detail from the experiment down. This is why ELN/LIMS developers make their data models customizable. But there are always limits. And those long and tedious requirement lists that IT departments and consultants often use to select ELN/LIMS software don’t really check whether the data model that best fits your organization is compatible with the system’s data model.

They don’t check this because it’s hard to do - both to figure out and document the ideal data model, and to compare this to what an off-the-shelf system allows.

But it’s probably worth trying.

Scaling Biotech is brought to you by Merelogic. We’ll help you design and build your information architecture to support modern AI applications, ensuring you can turn your ML prototypes into tangible impact. To learn more, send me an email at jesse@merelogic.net

KEVIN CRAMER

May 3, 2023

Or...don't use an ELN like that where you end up hacking around the constrained data model and duplicating data everywhere. Use a proper platform and none of this is an issue at all. Better yet combine a proper platform with proper search tools and again you can get the data in a workable way with zero SQL, zero code and any user can do it. Such a solution exists :)

Expand full comment

Satya Singh

I love this article. I refer to this as a data dictionary coming from travel tech before starting Scispot.com

Every ELN/LIMS should eventually support a configurable data dictionary. I have seen many companies start with implementation without knowing their data dictionary - obviously it evolves as you grow your workflows, and results.

However, having a clear ontology map and how it impacts your downstream computational workflows pays off.

Scaling Biotech

Don't neglect your data model

Discussion about this post