Why generic LLMs underperform in clinical workflows

I have spent the last two years deploying LLM-backed systems into clinical and clinical-adjacent workflows. The pattern that has surprised me most is how poorly the standard demonstrations of LLM capability translate into the clinical setting. The models are doing fine on the benchmarks. The systems are struggling in the workflow. The reason is worth unpacking.

Clinical inputs are not benchmark inputs

A typical clinical input is a free-text nursing note, a partial discharge summary, a progress note that references a medication the prescriber spelled three different ways, and a problem list that has not been reconciled in eighteen months. Benchmark inputs, by contrast, are clean prose, written by a single author, with a coherent structure. The model that scores well on the benchmark is being asked to do work it has never done, the moment it touches the real workflow.

This is not a fine-tuning problem. It is a domain coverage problem. The model has seen clinical text in its pre-training, but the proportion is small relative to the variety of forms that text takes. Once you ask the model to extract structured information from a real progress note, you discover that the note refers to medications by abbreviation, mentions allergies in passing, and contains negation patterns that the model sometimes resolves correctly and sometimes does not.

The negation problem

Negation in clinical text is the single failure mode I see most often. A nursing note that says, the patient denies chest pain, contains the words chest and pain, and a model that is doing keyword-shaped extraction will sometimes record chest pain as a present symptom. A more sophisticated model will get this right most of the time, but the failure rate is non-zero, and in clinical workflows even a small failure rate on negation is unacceptable.

The fix is not to argue with the model. The fix is to recognise that negation is a structural property of clinical language and to address it at the application layer, with rule-based post-processing or with a clinical NLP library that has been trained explicitly on clinical negation. The model handles the unstructured cases. The rule layer handles the cases where language structure dominates.

The temporal problem

Clinical text is also full of references to time that the model does not always handle correctly. Patient was on metformin two years ago, currently off. The model has to decide whether metformin belongs on the current medication list. A naive extraction will include it. A correct extraction will not.

This is the same kind of problem as negation. It is structural, not statistical. The model can be coaxed into getting it right much of the time with careful prompting. It cannot be made to get it right reliably enough for a clinical workflow without support from the application layer.

The reconciliation problem

Clinical workflows are not single-document workflows. The information that matters lives across the EHR, the lab system, the imaging report, the medication administration record, and notes from previous encounters. A general-purpose LLM, even one with a long context window, is not built to reconcile these sources. It will summarise what is in front of it and call that the answer.

The architectures that work in clinical settings tend to do explicit reconciliation work outside the model. They retrieve specific facts from specific sources, validate them against each other, and only then give the model a clean composite to reason about. This is more engineering than the demos suggest. It is also why production clinical AI takes longer to build than the executive team usually budgets for.

What this means for build versus buy

The question of whether to build clinical AI in-house or to buy it is currently being decided badly inside many health systems. The buy decision is being made on the strength of demos that do not test the failure modes that matter. The build decision is being underestimated because the demos make it look easier than it is.

The right framework, in my experience, is to ask whether the clinical workflow you are targeting has been done well by any vendor at any health system that resembles yours. If yes, buy and integrate. If no, you are building something new, whether you call it that or not, and you should resource it accordingly. Pretending otherwise is the path to the eighteen- month pilot that everybody recognises and nobody wants to admit they are running.