The first "AI Scientist" won't be a chat bot

Jun 18, 2025

I’ve been wanting to write about the interesting work that FutureHouse has been doing on reasoning models for biology, but I’m torn about how to frame it: On the one hand, I’m worried that the problems may not be very urgent/expensive to their prospective users. On the other hand, I know that I’m often overly skeptical of AI solutions and I want to give them a fair assessment. It’s not like anyone was urgently asking for ChatGPT, and yet many of us have now integrated it into our lives to the point where it’s probably unhealthy. So after pondering it for a while, I decided to write about the obstacles that I think an approach like this would need to overcome to become a viable “AI scientist.” So that’s what this post is.

In between writing these posts, I'm building a catalog of scientific AI x Bio R&D use cases, to help pharma/biotech leaders decide where to invest. I’m looking for feedback on an early prototype I just published. Check it out and send me any suggestions/corrections/additions at jesse@merelogic.net.

From talking to a number of teams that have built chat bots for science, the consistent story I hear is that a lot of people are using them but very few people are clamoring for more. And this is consistent with how many people seem to use LLM chatbots like ChatGPT: We use them for lots of little things. The more we use them, the more reasons we think of to use them for other things. But for most people, there isn’t any one thing that only ChatGPT can do and we otherwise can’t live without.

The one major exception, of course, may be writing code. LLMs have proven very effective at augmenting humans in developing software, whether users take it to the extreme of vibe coding, or just use it as a substitute for StackOverflow. But this is happening less and less in chat interfaces, more and more directly in an IDE/code editor.

A number of people have pointed out that it usually takes a while for the form factor of new technology to catch up with the technology itself. Early cars looked like stage coaches without horses. Early mobile apps looked like embedded web pages. So it’s natural to expect that as more specific use cases emerge for LLMs, they will evolve into new form factors, likely embedded in the kinds of tools that scientists are already using.

That’s why, like the title says, I think the first successful “AI Scientist” won’t be a chat bot. This addresses the first layer of my skepticism, but there are still two big hurdles that FutureHouse, or anyone else building an AI scientist, will have to overcome before they find a killer use case, regardless of the UI.

The first hurdle is that LLMs aren’t actually very good at reasoning, while scientists are. In fact, most scientists like doing the reasoning and would be reluctant to give it up even if the models were actually better.

On the other hand, LLMs are better than scientists at things like information retrieval and objective pattern matching. And while scientists may still like doing some or all of these things, this fact at least opens the door to tasks that the LLM is objectively better at. It’s just a question of finding an implementations and form factors that allow the LLM to complement the scientist’s reasoning. I don’t know what those will look like, but it’s certainly possible.

That was the easy one.

The second and bigger hurdle is more systemic: Most of the applications where you need the kinds of information retrieval and pattern matching that LLMs are good at, happen in the shortest, least organized and least appreciated stages of the drug development pipeline.

Once a program gets into screening, then optimization, then preclinical, etc. most teams are just following a process, using internally generated for-purpose data. They may occasionally need to go back to hypothesis generation when they hit an anomaly or a dead end, but they’re not planning for that.

That means that the place where scientists/program teams might reach for something like what FutureHouse has built is at the very beginning - in the early exploration stage of a new program. That’s when they don’t really know what they’re doing yet and don’t have any data, so they want to use as much public data as possible to understand what others have already found.

This stage is very open-ended so there aren’t standard tools that an AI solution can slot into the way it does with a code editor. There’s also this general pattern that earlier stages of the pipeline are valued less than later stages, and the earliest stage is typically valued the least. So that’s going to make it harder to find that one killer use case that decision makers recognize as an urgent, expensive problem.

That’s not to say it isn’t possible - I’ve been wrong about predictions like this often enough to stop making them (mostly). I’m just trying to explain why it’s hard.

I also think there’s potential, in the long term, for AI to start changing the calculus of how the stages of the pipeline are valued. I touched on this in my post about Axiom, but I think it’s worth a longer post about this in the future. Maybe next week, if there isn’t any more pressing biotech news between now and then. In the meantime, I’m definitely keeping my eye on this space.

Thanks for reading Scaling Biotech!

Scaling Biotech

The first "AI Scientist" won't be a chat bot

Discussion about this post