How far should we be trusting LLMs?
This would probably be a great post for me to write about all the announcements that came out of JPM, but I’ve spent the last couple of weeks deep in the weeds of figuring out what LLMs can and can’t do to help me build the Biopharma AI Landscape. So I wanted to quickly jot down some thoughts about that, and I’ll hopefully have something interesting to say about JPM news in the next few posts.
I’ve started thinking about the limitations of LLMs in three categories:
Limitations of available data: There are some things we just can’t know, and any approach to using that data will either be limited in what it can do, or will have to make a best guess given the information that is available (something something conditional probability, something something information theory…)
Limitations of instructions: LLMs try to do what you tell them to do, but if you don’t know what you want or don’t know how to put it into words, then they’re going to essentially be guessing on the details. LLMs give us the unique ability to communicate in natural language, but using human language to communicate with something that isn’t human disguises more limitations than it removes.
Limitations of the technology: Given the same data and the same precision of instructions, LLMs are going to do some things better and some things worse than other technology, or than a human could do. This is where it really gets interesting.
Given those limitations, the next question is what you’re trying to accomplish. For the rest of the post, I want to explore how this plays out for three specific kinds of tasks. This is nowhere close to an exhaustive list, but I’ve found it a useful way to think about it for the work I’ve been doing.
Recall
We’ve been conditioned to think of ChatGPT as a drop-in replacement for Google search, i.e. a way to look up information. The benefits are that you can describe what you’re looking for much more precisely than you could in a web search bar. The trade-off is that you’re handing over to control to a stochastic black-box.
This is not to say we shouldn’t use LLMs for information recall, only that we should be thoughtful about when we use them vs other available tools.
There are essentially two ways that LLMs do recall: 1) There are certain things the model “knows” and can tell you. 2) It can look up information in another source such as a database, web search, an embedding database, etc. then translate (see next section) the results into what you were looking for. This second way is called Retrieval Augmented Generation (RAG).
It’s interesting to think about the things that a model knows because it demonstrates that there isn’t a clear line between understanding the meaning of words (arguably what these models were originally trained for) and knowing more fundamental knowledge.
But on a more practical note, LLMs are definitely biased. No shame in that - humans have things like recency bias and confirmation bias. LLMs have their own set of biases that manifest in things like hallucinations. But even when they’re not hallucinating, they’re more likely to recall certain information than you probably want them to.
For example, I’ve been testing how good LLMs are at telling me which companies offer similar tools and products. If you ask ChatGPT “who does [biopharma AI startup XYZ] compete with” it will almost always say Benchling, or maybe Medidata. And like, yeah, I get that - on some level Benchling competes with everyone in this space. But that’s often not what I’m looking for.
That’s part of why I think the Biopharma AI Landscape is necessary, despite the fact that you could just ask ChatGPT: Having the data in a deliberately designed form that you can feed to the LLM in a controlled way turns recall into a translation problem (next section), something that LLMs are more consistent and generally better at.
Using word/concept embeddings as an intermediate step for recall is a big improvement because it gives you more control over the process. There’s still going to be some bias based on how the dimensions in the embedding space are weighted, but at least it’s a more controlled and deliberate process.
Translation
This is what I think LLMs are really good at. In fact it’s arguably the only thing they do: When they read your instructions and any input data, they map it into a concept space. Then they follow some path from that starting point. This path is a mix of recalling information that they already “know” and translating what you gave them back into words and symbols. As noted above, the former is a bit unreliable. But they’re generally pretty good at the latter.
I mean, they’re still biased and still stochastic. But I’ve found them to be more reliable at translating what you give them from concept space back into words/symbols than at going off completely on their own.
So I’ve basically gotten to a point where I try to turn any task I use an LLM for into a translation problem. That doesn’t mean I always can - there are some times where you need to use an LLM for pure recall, particularly if you’re comfortable with the potential bias. But if I can avoid this, I do.
Categorization
Where I really want to use LLMs, but have so far been struggling, is for putting things into categories. The way I categorized all the companies in the Landscape was to create careful descriptions of each category, have the LLM generate a short description of each company, have the LLM pick one or more categories for each company, then throw away the results and do it myself manually.
The results from the LLM were just absolute (am I allowed to swear on here?). I asked it to use the short descriptions of each company, long descriptions, the raw contents of their websites. I tried giving it more detailed instructions about how to pick the categories, giving it more detailed descriptions of the categories. But no matter what I tried, I could never get it to be maybe 60-70% accurate.
So I ended up going through each company and just assigning the categories myself - all 368 of them. I’m sure I wasn’t 100% accurate either, but I definitely did better than the model. (Granted it was an unfair competition since I was the one who defined the categories in my own biased way…)
The last couple of weeks, as I’ve been digging into the data more, I’ve realized just how important it is to be able to categorize the companies, their features, etc. in different ways. So I’ve been returning to this problem of how do you get an LLM to consistently categorize things. (This is the reason I’m writing about this today instead of about JPM.)
I think a large part of the difficulty is limitations of instructions: You need to be able to define the categories clearly and objectively. Not just abstractly as what you’re looking for - also how others are likely to describe it. But many categories end up relying on “you know it when you see it.” And that second part - figuring out how others may describe what you’re looking for and then organizing that information - is a lot of work. Or at least I haven’t yet found a way to get the LLM to do more of it so I can do less of it.
What I need to do is figure out how to turn these categorization problems into translation problems that I can then augment with classical algorithms and heuristics. I haven’t gotten there yet, but when I do I’ll let you know.
And yes, I’ve tried asking ChatGPT…


I think these limitations are very real, especially around instructions and task framing. I’ve been testing LLMs extensively in my own work and have run into many of the same issues you describe, particularly when the task drifts into pure recall or fuzzy categorization.
What’s helped me in some cases is being very explicit about the end goal and constraints, then reframing the task so the model is mostly translating structured inputs rather than inventing structure on its own. I’ve also experimented with letting the model run more independently after that framing, then treating its output as something to audit rather than accept.
Your breakdown of recall vs translation vs categorization really resonated, and I’m still working through similar challenges on the categorization side. I’d love to compare notes if you’re open to it, feels like we’re circling the same problems from slightly different angles.