AI systems rival doctors in new Nature studies, but one result suggests the tech won't age well

wpnews.pro

Two new studies published in Nature show that specialized AI systems diagnose diseases and make treatment decisions as well as physicians in simulated patient cases, sometimes even better. Both systems run on base models that are already outdated.

AI programs built specifically for medicine are getting closer to real clinical value. That's the takeaway from two papers published simultaneously in Nature. The German system MIRA outperformed doctors in diagnosing conditions like pancreatic cancer and pneumonia. Google's system AMIE produced more accurate treatment and testing plans.

MIRA works like a doctor inside a simulated hospital #

MIRA stands for Medical Intelligence for Reasoning and Action. It was developed at TUD Dresden and Heidelberg University, among other institutions. Unlike standard chat tools, the system operates as an autonomous agent inside a sealed, virtual electronic health record. According to the study, MIRA can choose from more than 85,000 options across eleven tools. It takes patient histories, orders lab work, microbiology tests, and imaging, interprets results, generates differential diagnoses, and writes treatment plans including prescriptions, surgical planning, and hospital admissions.

The team tested MIRA on more than 500 real emergency department cases from the public MIMIC-IV dataset. A second AI agent played the patient, sharing only information from the actual medical record.

Across eight disease categories, MIRA hit the right diagnosis 88.9 percent of the time, measured against the diagnoses documented in the dataset. For a direct head-to-head comparison, both sides worked through a subset of 311 cases under identical conditions. MIRA scored 87.8 percent. Four experienced specialists reached 78.1 percent. A mixed team of residents and specialists managed 71.1 percent. MIRA did best on appendicitis (98.6 percent) and pancreatitis (92.3 percent). Both AI and doctors struggled more with pneumonia (72.4 percent) and urinary tract infections (77.6 percent).

The researchers also checked how safe the recommendations were. Blinded specialist reviewers who didn't know whether a recommendation came from MIRA or a human found no dangerous drug interactions, no incorrect dosing for patients with impaired kidney function, and no risky painkiller prescriptions. MIRA was nearly perfect at capturing a patient's current medications. It also nailed the question of whether a patient needed to be admitted: it didn't miss a single case that required hospitalization. Performance held steady even when test patients spoke only German or French, or acted particularly anxious. The source code is available on GitHub.

AMIE pairs two agents with clinical guidelines #

Google's AMIE takes a different approach: managing patients across multiple visits. The system has two parts. A conversational agent handles the fast, friendly dialogue with the patient. A second agent works in the background, thinking more carefully and cross-referencing the case against medical guidelines.

In a tightly controlled study, Google compared AMIE with 21 primary care physicians across 100 cases spanning multiple visits. The benchmark was the UK's NICE Guidance and BMJ Best Practice guidelines. Actors portrayed patients via text chat. According to the study, AMIE matched the physicians on treatment decisions and beat them on plan accuracy and guideline adherence. At the first visit, AMIE's overall plan was rated appropriate in 95 percent of cases. For the physicians, that number was 72 percent. Both specialist reviewers and the patient actors preferred AMIE more often than the human doctors.

To test drug knowledge, the team built a dedicated benchmark called RxQA, based on two national drug formularies and verified by licensed pharmacists. AMIE outscored the primary care physicians on the harder questions. The test was tough for both sides, though. Even on the easier questions, the best score stayed below 75 percent.

Both teams warn against jumping to conclusions #

The authors are clear about the limits of their findings. MIRA recommended "care that deviated from best practices" for a "small but non-zero" share of patients. The simulated patient's answers may also have been "more structured than real speech of patients in emergency departments." And it can't be ruled out entirely that the freely available MIMIC-IV dataset was already part of the training data for the models used. If so, the measured performance would be more of a ceiling than a realistic estimate. The comparison physicians also worked in the German emergency department system, which differs from other countries.

The AMIE developers call their study a "milestone" but stress that neither the case selection nor the text-only conversations reflect a real clinic. The system shows "promising capabilities" but is "not ready for real-world translation." More work is needed to address "latent reasoning errors" that can creep into the system's hidden reasoning steps.

Jakob Kather, whose research group co-developed MIRA, told the Financial Times: "We are getting a preview of how AI could transform medicine." He compared AI agents like these to an airplane's autopilot: "These systems can support and relieve medical professionals by taking over routine tasks, but ultimate responsibility will always remain with the physicians."

Independent experts temper the excitement #

Researchers not involved in either study praised the careful methodology but pointed out that these are simulations. Catherine Pope, a professor of medical sociology at the University of Oxford, told the FT that this is "some remove from the messy, complex, human world of everyday healthcare."

Julie Jacko, a professor of health informatics at the University of Edinburgh, said many of the reported advantages came down to "precision and completeness of plans" rather than "clear differences in clinical correctness." The study "demonstrates performance against a structured standard rather than fully capturing the complexity of real clinical decision-making."

Scaffolding helps weak models most, stronger ones don't need it #

One of the most revealing findings is buried in AMIE's supplementary experiments. As often happens with peer-reviewed studies, both systems rely on older AI models. AMIE still runs on Google's older Gemini 1.5 Flash. MIRA uses OpenAI's GPT-4o and o1-preview. All of these have since been surpassed by newer generations.

Google's researchers swapped out individual components to figure out what actually drives performance: the elaborate scaffolding of two-agent architecture, guideline matching, and specialized training, or simply the underlying language model.

With the older Gemini 1.5 Flash, the specialized setup delivered the big performance boost the study describes. But when the researchers dropped the same setup onto the newer Gemini 2.5 Flash, the advantage almost vanished.

The specialized system, in other words, compensates for the older model's weaknesses by forcing structured reasoning, making it cite guidelines, and suppressing hallucinations. A stronger model can do all of that on its own. The paper acknowledges that AMIE's value shrinks as the base model improves. In fact, newer general-purpose models like Gemini 2.5 Pro, o3, and GPT-5 already score "largely comparable" to the full AMIE system on the RxQA drug test.

In practice, AMIE appears to have been overtaken by the pace of AI development. It's a pattern that keeps repeating: scaffolding around language models becomes redundant as stronger models arrive, sometimes because the scaffolding itself feeds into training data for the next generation. That doesn't make the ideas behind it worthless: In coding and increasingly in other areas, scaffolding tools like Claude Code, OpenAI Codex, and Claude Cowork give models access to tools, context, and memory. Even stronger models still perform better on complex tasks with that kind of support. But the scaffolding has to keep up with model performance, or it eventually becomes dead weight.

MIRA lacks this kind of analysis. Part of its architecture, though, is less about patching model weaknesses and more about connecting the AI to a hospital's clinical systems. That part wouldn't become obsolete with stronger models.

AI News Without the Hype – Curated by Humans

					Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section.				

					Subscribe now
Read on for the full picture.Subscribe for hype-free coverage.

Access to all THE DECODER articles.
Read without distractions – no Google ads.
Access to comments and community discussions.
Weekly AI newsletter.
6 times a year: “AI Radar” – deep dives on key AI topics.
Up to 25 % off on KI Pro online events.
Access to our full ten-year archive.
Get the latest AI news from The Decoder.

Subscribe to The Decoder

source & further reading

the-decoder.com — original article OpenAI Presence wants to make AI agents production-ready for businesses Meta AI uses a second AI agent as a memory coach to keep long tasks on track A real macOS flaw worth $200K went unreported because Apple's bug bounty inbox was full of AI slop