# Why Healthcare AI Fails in the Real World

> Source: <https://dev.to/sciforce/why-healthcare-ai-fails-in-the-real-world-5865>
> Published: 2026-05-27 14:02:04+00:00

In 2018, a clinical informaticist launched a tool to handle intake forms and clinical notes so doctors could spend less time typing and more time doctoring. A small [study](https://arxiv.org/abs/2306.13680) with 18 medical students suggested that the Cydoc smart intake form could substantially reduce note-writing time while maintaining note quality, although broader validation in practicing clinicians was still needed. By August 2025, the company was gone.

The [postmortem](https://glassboxmedicine.com/2026/02/21/why-i-shut-down-my-bootstrapped-health-ai-startup-after-7-years-a-founders-postmortem/) names the main reason: Cydoc lived outside the EHR. Doctors had to copy the notes from the Cydoc interface and paste them into the EHR, which meant working in two windows and adding an extra workflow step for routine clinical documentation. The founder later described the lack of EHR integration as a fatal adoption mistake.

Cydoc isn’t an exception. Even with a strong model, healthcare AI projects can fail when they add friction to already complex clinical workflows. A [Gartner survey](https://councils.forbes.com/blog/from-ai-hype-to-roi-how-leaders-can-realize-value-from-genai) of infrastructure and operations leaders conducted in late 2025 found that only 28% of AI use cases fully succeeded and met ROI expectations, while 20% failed outright; poor data quality, limited data availability, and weak workflow integration were among the reported barriers. From pre-build through pilot and scale, the same mistakes are made, and the good news is that they are not inevitable.

Pre-build failures are the easiest to miss because there is nothing to debug yet and nothing live to roll back. By the time the consequences show up, fixing them can be significantly more expensive than preventing them during product design, data access planning, and workflow discovery.

Cydoc knew from the beginning that EHR integration mattered: the founder had lived through broken EHR workflows in her own clinical training. But the company couldn't afford to build it, so they shipped without it and postponed the problem. The EHR integration never arrived, and Cydoc spent years trying to sell a tool that required clinicians to change their workflow instead of fitting into it.

The most common pre-build failure starts when someone finds something the model can do well and only then starts looking for a clinical problem to attach it to.

The tool gets built, the scores look good, and nobody uses it. An alert that confirms what a physician already suspects, or points at a risk they can't act on in that moment, gets ignored regardless of how accurate it is.

Before building anything, find one clinician who deals with the problem you are targeting and ask when exactly it happens in their shift, what they do now, and whether a tool like yours would genuinely make the job easier or just add more friction. For healthcare AI, “user discovery” is not a marketing exercise, it is a clinical safety, adoption, and implementation requirement. Sometimes the answer points away from AI entirely, and accepting that at the very beginning saves months of work and thousands of dollars.

The common mistake is thinking that the data will look something like a labeled research dataset. Real EHR data is chaos: a large share of clinically meaningful information exists in unstructured notes, reports, and narratives, and much of it is not mirrored in structured fields. Any project counting on clean, analysis-ready data will hit this wall.

A [2025 study](https://www.jmir.org/2025/1/e66910/) across 1.8 million patient records found that only 13% of clinically relevant concepts in free text had any equivalent in structured fields. At the visit level, where a clinician documents a specific encounter, that dropped to 7%.

On top of that, the same diagnosis gets coded differently across departments, and missing values follow patterns that reflect documentation culture rather than patient reality. A model trained on this may treat these artifacts as clinical signals.

SciForce ran into this semantic standardization problem while building internal healthcare AI tools: terms from source systems that wouldn't map to standard vocabularies, clinical details lost in conversion, specialists pulled into weeks of manual work without consistent results. That's how [Jackalope](https://sciforce.solutions/case-studies/transforming-complex-medical-data-into-clinical-insights-with-jackalope-kompaepxdx7bx1hw7kwmtp74) was born – an ML-powered tool for automating medical data standardization across OMOP CDM and SNOMED CT. For teams building healthcare AI, this is not a peripheral data-cleaning task; it is the layer that determines whether a model can be trained, validated, explained, and reused across sites.

Paperwork and patient data access are a common point of collapse: you need to get ethics board approval, permission to use de-identified data, pass IT security checks, and often data use agreements. In many institutions, these processes are sequential or only partially parallel, which turns data access into a project-critical dependency rather than an administrative detail.

A [study across 277 protocols](https://www.hsrd.research.va.gov/research/citations/abstract.cfm?Identifier=85476) found that ethics review takes 112 days on average across 10 VA Institutional Review Boards – now imagine the time needed for a small startup. A 2[025 multi-site study](https://www.researchgate.net/publication/395537071_Multisite_research_using_electronic_health_record_data_Lessons_learned_from_a_case_study) documented that data use agreements take 26 months to execute, with actual data extraction taking another 14-22 months. At this scale, two months of training a model easily become years of waiting for approval.

The practical response is to start the paperwork from day one, before the model architecture is even sketched. In the meantime, use publicly available datasets like [MIMIC-III/IV from PhysioNet](https://physionet.org/content/mimiciv/) or the [eICU Collaborative Research Database](https://physionet.org/content/eicu-crd/) to train your model. [Synthetic data](https://sciforce.solutions/blog/synthetic-data-a-passing-trend-or-the-future-of-ai-favo134k5h5mhlk7bhtr1f5m) can be useful for testing pipelines, interfaces, privacy-preserving workflows, and some model-development assumptions, but it should not be treated as a substitute for validation on representative real-world clinical data.

Every pilot starts the same way: the demo goes well, someone says "this could really change things", and two months later, no one is using the product.

Cydoc had paying customers who weren't using the product because it meant changing a workflow that already worked well enough. A tool can be technically sound, clinically relevant, and still end up unused for reasons that have nothing to do with the model.

Getting good scores during internal validation is a success, but it’s not a sufficient reason to deploy the model.

A [2025 JAMA Network Open study](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2843179) reviewed same-admission AI models in literature and found that 40.2% of them were trained on ICD codes as input data to predict mortality. However, ICD codes are assigned by billing staff after the patient is discharged and describe the final diagnosis, not what was known at the beginning of the treatment. In the authors’ mortality prediction experiment, models using ICD codes achieved very high AUROC values, illustrating label leakage rather than clinically usable prospective prediction. To avoid a similar situation, audit every input available at the moment the clinician needs to use the model. Even a small second-institution validation cohort can catch what internal testing misses.

After enough false alerts that don't get clinicians anything specific to act on, they learn that the interruption isn’t worth it.

[External validations](https://academic.oup.com/jamiaopen/article/7/4/ooae133/7900014) of Epic sepsis prediction models have repeatedly shown that performance can vary by site, threshold, patient population, and implementation context; before publication, this exact “14%” figure should be verified against the cited paper. And even when it fired correctly, it often arrived after sepsis had already been identified by other means. When it comes to alert systems, alerts should not only be accurate, but arrive in time and provide enough information for clinicians to act differently because of them.

Another question is whether an alert system is the right interface at all. For a healthcare technology provider, SciForce built an [LLM-powered semantic search](https://sciforce.solutions/case-studies/deploying-medical-semantic-search-with-lightweight-mlops-pipelines-e9st91v2supk8nmsfpext1gi) that lets a doctor ask a question about a specific patient – in plain language, at the moment they're ready to act, and get a relevant answer pulled from the patient's records. This is a different design philosophy: instead of pushing another alert into an overloaded workflow, the system supports clinician-initiated retrieval at the point of decision.

A reliable predictor of pilot failure is a tool that requires clinicians to leave the system they already work in. Cydoc lived outside the EHR, which meant the clinical staff had to manage a second interface: one extra step for each patient on every shift.

[Duke University](https://www.jmir.org/2020/11/e22421/) hit a related workflow-integration challenge with Sepsis Watch. The sepsis prediction tool was deployed on a separate iPad, which meant nurses had to monitor the iPad, cross-reference the patient chart, and manually pass the alert to the treating physician. The hospital had to create an entirely new nursing role to connect AI and the clinical workflow. This doesn’t mean the system failed clinically. Duke later reported expansion of Sepsis Watch. But it does show that successful AI deployment may require new labor, new roles, and active workflow repair, not just a model and an interface.

Johns Hopkins solved the same problem differently. They embedded a similar sepsis model directly as a clickable icon in the existing EHR interface, with no separate system or login required. Across five hospitals, [89% of alerts](https://www.nature.com/articles/s41591-022-01895-z) were evaluated, and patients whose alerts were confirmed within three hours showed an 18.7% reduction in mortality. The lesson is not that one interface pattern always wins, the lesson is that adoption depends on whether the tool fits the clinical decision pathway, accountability structure, and timing of care.

A successful pilot means the model worked for one institution. To turn it into a widely adopted and commercially successful product requires consistent performance at new sites, regulatory clearance, and architecture that scales without the need to rebuild it from scratch.

A [2026 multicenter study](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2845595) tested the Epic Sepsis model across numerous hospitals. The model assigns each patient a sepsis risk score based on their clinical data, but the same cutoff doesn’t work well for all hospitals. To catch 60% of sepsis cases, one hospital would need a risk score cutoff of 14, while another would need 37. An analysis across a network of nine hospitals showed that performance ranged from poor to acceptable, with no single benchmark that worked well across all sites.

Take two hospitals: a large urban teaching hospital treating post-surgical complications and ICU patients, and a smaller regional hospital receiving lower-acuity cases.

Naturally, the average patient from an urban hospital has a higher baseline sepsis risk than one from a regional site. That alone shifts the scoring baseline. The first hospital is also likely to have stronger lab infrastructure, more advanced equipment, and more detailed documentation. That means that the model trained on its data would rely on a richer data picture. A single configuration wouldn't work equally well for both sites: set the cutoff too high and the model would miss sepsis in regional hospitals; set it too low, and the model would flood the urban hospital with false alerts.

You need to deal with this problem before deployment: avoid institution-specific dependencies, and run second-site validation during development, rather than after signing the contract. Even without such dramatic site differences, patient populations still change over time, clinical practices evolve, and documentation quirks shift. Together, those changes can quietly degrade model performance in production before anyone notices. To avoid this, continuous monitoring and retraining need to be planned during development.

For a public healthcare organization monitoring [region-wide infection spread](https://sciforce.solutions/case-studies/mlops-in-action-with-scalable-selfupdating-infection-spreading-prediction-pipeline-eseborfnf81gg4j12iyd4fbu), SciForce built a pipeline with automated retraining triggered when a drift score exceeded a defined threshold. The same practice can be applied to multi-site deployments, where each new site introduces the model to a different data environment. For clients, this changes the procurement question from “Can you build a model?” to “Can you operate and monitor this model safely after deployment?”

The line between a clinical decision support tool and a regulated medical device is not obvious.

For non-device clinical decision support, the [FDA](https://www.fda.gov/media/109821/download) focuses on statutory criteria including whether the software analyzes medical information rather than images or device signals, whether it supports rather than replaces professional judgment, and whether the clinician can independently review the basis for the recommendation.

The most consequential factors are intended use, transparency, and whether the clinician can independently review the basis for the recommendation.. A tool that says "this patient has sepsis" is making a diagnostic claim and is likely regulated.A tool that says "three of the seven sepsis criteria are present in this record, here are the values" is surfacing information and leaving the judgment to the clinician, making it more likely to fall outside the regulated category. This distinction is not a loophole, it must be reflected consistently in product design, labeling, user interface, validation strategy, and sales language.

[Kintsugi](https://www.kintsugihealth.com/blog/open-source) hit the regulatory wall hard. They built a machine learning tool for anxiety and depression screening based on short free-speech voice samples. A [peer-reviewed study](https://pmc.ncbi.nlm.nih.gov/articles/PMC11772039/) across about 15,000 participants found sensitivity of 71.3% and specificity of 73.5% in detecting moderate or severe depression – a result comparable to other mental health screening tools.

To scale as a diagnostic AI product, the company needed [FDA De Novo](https://www.mdpi.com/2227-9059/13/12/3005) authorization. De Novo is the regulatory pathway for products novel enough that no FDA-cleared equivalent existed to point to – the longer, more expensive route compared to the standard 510(k). For FY2026, FDA user fees are $26,067 for a 510(k) and $173,782 for a De Novo request, review timelines vary, and the FDA De Novo goal is 150 FDA review days excluding time on hold, while studies of AI/ML-enabled devices have reported longer median review times for De Novo than 510(k). The [filing fees](https://www.fda.gov/industry/fda-user-fee-programs/medical-device-user-fee-amendments-mdufa-fees) alone run $26,067 for a 510(k) and $173,782 for De Novo.

The venture-backed product was ultimately unable to survive that timeline, combined with the cost of the clearance process. In February 2026, Kintsugi shut down commercial operations and open-sourced its work.

Map your intended use case against the FDA's four-factor test before committing to a product architecture. If there is any uncertainty, engage a regulatory consultant: the cost of early advice is a fraction of what a late discovery costs.

Most early healthcare AI products are built around one institution's specific setup. That works for a pilot. The problem starts when you scale to a second site with a different EHR vendor, unfamiliar data structures and new ways of recording clinical information.

One architectural fix is to build the integration layer around standards such as HL7 FHIR where appropriate, while recognizing that [FHIR](https://www.healthit.gov/topic/standards-technology/standards/fhir-fact-sheets) alone does not solve terminology mapping, local workflow variation, historical data extraction, or analytics-ready cohort construction. Certified EHRs are now required to support FHIR-based APIs under the 21st Century Cures Act, which means a standardized data layer is achievable without custom extraction work at each new site. This creates a more realistic path to standard integration, but not a guarantee of plug-and-play deployment.

When a German university hospital needed to connect observational research data to operational clinical workflows, SciForce built an [OMOP CDM to HL7 FHIR conversion pipeline](https://sciforce.solutions/case-studies/automating-researchtocare-data-integration-via-omop-and-fhir-ps1niuf9hicee2orkdi1neym) that made real-time data exchange between the two systems possible.

For a US health insurer working across multiple hospital systems with inconsistent data formats, SciForce built a [cloud-native pipeline on Snowflake](https://sciforce.solutions/case-studies/from-raw-claims-and-clinical-data-to-pcornet-cdm-endtoend-etl-on-snowflake-q2jtbw0ykhto7c31071wcvo6) conforming to the PCORnet CDM standard, turning what would have been a custom integration project at each new site into a repeatable process. This is the implementation layer many healthcare AI products underestimate: not model development, but repeatable, governed data movement across heterogeneous clinical environments.

Across all three stages, most of the factors that determine whether a healthcare AI project fails or survives are not about performance. By the time the model is ready to deploy, they are already locked by decisions made months and years earlier.

Clinical AI is hard, the regulatory environment is still maturing, and some projects fail for genuinely unpredictable reasons. But many of the most damaging failure modes are predictable: weak workflow fit, inaccessible data, label leakage, alert fatigue, site-specific model behavior, unclear regulatory strategy, and architecture that cannot travel. While successful deployment isn’t guaranteed, removing the nine most predictable reasons for failure is a much better starting point.

At SciForce, we treat healthcare AI deployment as an infrastructure problem before we treat it as a modeling problem. That means building the data layer, terminology mapping, interoperability strategy, monitoring logic, and clinical workflow fit early enough to prevent predictable failure. If your AI product is moving from prototype to pilot, or from pilot to scale, this is the moment to examine whether the architecture is ready for real clinical environments.

Explore more of our insights on building healthcare AI that actually ships → [https://sciforce.solutions/case-studies?tag=healthcare](https://sciforce.solutions/case-studies?tag=healthcare)
