Artificial intelligence has already proven it can perform specific medical tasks, such as interpreting X-rays or flagging risks in patient data. But caring for patients is not a series of isolated decisions. It is a dynamic process that unfolds over time, requiring clinicians to interpret signals from multiple sources and intervene as a patient’s condition changes. Stabilizing a patient may require a physician to synthesize lab values and medical images, listen to lung or heart sounds, observe physical responses, and decide when to escalate care — often under severe time pressure.
Given that complexity, how far can modern AI systems really go? More specifically, can a large language model manage an entire clinical decision-making workflow, rather than just individual tasks within it?
That question is the focus of a new white paper by Mack Institute co-director Christian Terwiesch, Mack Institute predoctoral fellow Lennart Meincke, and Arnd Huchzermeier of WHU’s Otto Beisheim School of Management. The paper is the latest in a series of generative AI experiments by Terwiesch and colleagues, supported by the Mack Institute.
To explore this question, the researchers placed a multimodal large language model inside a realistic medical training simulation — the same type of system used to evaluate medical students and practicing clinicians. On screen, a virtual patient’s condition evolves in real time: vital signs change, test results arrive with delays, and inaction has consequences.
Rather than responding to a written prompt (such as “a 50-year-old male presents with chest pain”), the AI must decide what to do next at every step. It can question the patient, turn on monitors, order lab tests or imaging, administer treatments, and escalate care — all while the clock is running and the patient’s condition may be improving or deteriorating. In effect, the system is evaluated not on a single answer, but on whether it can manage an entire clinical encounter from start to finish.
How the AI performed
The researchers placed an off-the-shelf multimodal large language model (Gemini Pro 2.5) into BodyInteract, a medical training simulation widely used in education and certification. They evaluated the AI across four acute care scenarios, ranging from a simple at-home hypoglycemia case to complex emergency room situations involving pneumonia, stroke, and congestive heart failure.
The AI’s performance was benchmarked against more than 14,000 simulation runs by medical students, as well as against an experienced emergency physician who completed the same cases.
Across scenarios, the AI consistently stabilized patients and completed cases at rates comparable to — and in some cases higher than — medical students. It also completed cases substantially faster. Overall diagnostic accuracy was similar, and in many instances the AI’s sequence of actions closely resembled expert clinical practice.
Notably, the system was not trained to solve these specific cases or to imitate expert clinicians. Instead, the researchers evaluated a general-purpose model — not a custom-built medical system — and observed how it navigated diagnostic and treatment decisions when placed in a realistic, time-pressured clinical environment.
Understanding AI reasoning and confidence
Beyond whether the AI reached the correct outcome, the researchers examined how it reasoned along the way. As each case unfolded, they tracked how the system’s confidence in different possible diagnoses changed in response to new information, much like a clinician updating their thinking as test results arrive.
A clear pattern emerged. Early in a case, the AI tended to order tests that provided large amounts of new information, quickly narrowing the range of plausible diagnoses. As the encounter progressed, additional tests produced smaller gains, and uncertainty declined. In effect, the system behaved as if it were prioritizing the most informative actions first, rather than ordering tests indiscriminately.
Just as important, the AI’s confidence proved meaningful. When the system expressed high confidence in a diagnosis, it was very likely to be correct; when it remained uncertain, errors were more likely. This alignment between confidence and accuracy suggests that the AI was not simply overconfident, but able to distinguish between cases it had effectively resolved and those that remained ambiguous.
This finding stands out in light of growing concerns that large language models often express confidence that exceeds their actual reliability. In this dynamic, multimodal setting, the AI’s confidence tracked performance surprisingly well.
Where humans still matter
The study also highlights clear limits. While the AI was fast and effective at stabilizing patients, it consistently engaged less in patient communication than human clinicians. It also tended to order more diagnostic tests than an experienced physician, suggesting that expert judgment remains superior when it comes to cost-aware diagnostic decision-making.
For these reasons, the authors emphasize that their results should not be interpreted as support for unsupervised AI in healthcare. Instead, the findings point toward a more targeted role for AI as a workflow-level support system, or as a “second set of eyes” alongside a physician. In time-critical or resource-constrained environments, such as emergency departments, AI could act as a rapid stabilizer or triage assistant — managing information, monitoring patient status, and flagging high-risk cases — while clinicians focus on judgment, communication, and oversight.
From an operations and management perspective, the broader lesson is that evaluating AI solely on static benchmarks understates its potential impact. What matters is not just whether an AI reaches the right answer, but how it manages uncertainty, time pressure, and trade-offs across an entire process. As AI systems continue to improve, the central challenge may no longer be whether they can reason, but how they should be integrated into human-centered workflows.
***
Read the full white paper, Evaluating LLMs for Dynamic, Multimodal Clinical Decision-Making, here.

