Can you use data without having it? (Yes.)
Almost all of the pharma leaders I’ve talked to about how they identify and evaluate AI partners have, at some point in the conversation, lamented how technical partners always insist on owning the models, regardless of where the data comes from. So as I was writing last week’s post about Eli Lilly’s TuneLab, it got me thinking about all the different ways you can organize access and ownership for both biomedical data and the models trained on it. I don’t have a good framework for thinking about this yet, but I thought it might be interesting, this week, to look at three companies that have three different ways of doing this for otherwise very similar data.
The companies I’m going to look at are Owkin, nference and Tempus, all of which offer some combination of electronic health records (EHR) data, with associated genomic and digital pathology data. This is data collected from real patients, in the course of diagnosing and treating illness. It’s heavily weighted towards cancer patients, since they’re most likely to get genomic and histopathology screens. And it’s more of a “you get what you get,” compared to in vitro data where you can actually plan what you’ll collect/generate.
There are substantial differences in the nature and quality of these companies’ datasets, but that’s not what this post is about. Instead, I want to explore how each company collects this data and how they provide it to their customers/partners.
I’ll start with the simplest one, Tempus, which I’ve written about before. Their model is straightforward: They make genomic assays that hospitals and clinics use for cancer diagnoses. That’s the main purpose of the assays, but Tempus also has agreements with the hospitals/clinics that allows Tempus to de-identify the results and aggregate them into a research dataset. They then provide portions of this (de-identified) data directly to pharma companies and other research groups.
This is similar to how UK Biobank and the All of Us program in the US share data, except that these programs collect data solely for the research dataset, so they have mostly health individuals and have more consistent data for them.
Owkin and nference are different in that they don’t own the data. Instead, they act as intermediaries between the hospitals and clinics that collected the data and the researchers who want to use it.
In particular, because many of these hospitals and clinics don’t have the same level of sharing rights written into their patient agreements, or are just understandably cautious about sharing even de-identified data, Owkin and nference can’t just deliver the data directly to researchers. Instead, they have to create ways for researchers to use the data indirectly. And they’ve taken two different approaches.
nference is the more straightforward of the two: They still collect the de-identified data from all these clinics into a single centralized dataset. But instead of shipping this data directly to their customers, they allow them to analyze it through a carefully controlled user interface that allows them to define analyses and see aggregate statistics, but not see or download raw data. nference also has internal analysts who can do the analysis for customers.
Now, there’s often a fine line between small data aggregates and raw data, but what nference promises clinics is that it’s figured out how to walk that line. And what they promise their customers is that this gives them access (if indirectly) to data that they wouldn’t be able to use otherwise.
In some ways, this is similar to how 23andMe shares (shared?) insights and statistics from their own internal dataset, but doesn’t share the data directly. (Though who knows what their new business model will be?)
Owkin takes this idea one step farther, by offering indirect access to similar data but without even collecting their own centralized dataset. Instead they use federated learning, which is the same approach that Lilly’s TuneLab uses.
Here’s how federated learning works: A machine learning model is defined by a collection of numbers called weights. You train a model by checking how well it predicts values with a particular set of weights, then using an algorithm that computer scientists call back propagation and calculus teachers call the chain rule to determine how to adjust those weights to make the predictions closer to ground truth. If you do this over and over again, you get a trained model.
In federated learning, rather than sending the data to where the model is to run back propagation, you send the model weights to where the data is, run propagation there, then send the updated weights back to the central model.
This is what Owkin arranges between its clinical partners and its customers: Customers define models. Owkin ships the model weights around to the clinics, runs back propagation within their data centers, then hands the final weights (or, rather, the trained model) back to the customer.
Now, as with small data aggregates, there’s a fine line between a very detailed model and raw data. And that’s kind of a theme of a lot of my recent posts: Having a good model is as valuable as, or potentially more valuable than, having the dataset it was trained on. And it seems to be getting easier every day to turn the data you have (or someone else has) into a good model.
It’s nice that we’re seeing more viable solutions to these technical problems. The legal problems and contract negotiations, on the other hand - ownership, privacy, trust - are only going to get harder.

