Why FHIR plus RAG is the real unlock for clinical AI, and what to get right before you build it
Generic retrieval-augmented generation works well enough on policy documents and internal wikis. It fails predictably in healthcare. The reason is not the model, the embeddings, or the vector store. It is that clinical data does not look like a policy document, and treating it as if it does produces results that look plausible and are subtly wrong.
The fix is to build retrieval on top of FHIR (Fast Healthcare Interoperability Resources), the international standard for structured clinical data. Done properly, FHIR plus RAG gives clinical AI a clean retrieval boundary, scoped per-patient context, and a foundation model that only sees what it should see. Done poorly, you get a system that confidently fabricates medications, mixes up patients, and produces summaries that read well but cannot be trusted.
This post is for solution architects and engineering leads building clinical AI systems. It explains why FHIR matters for RAG specifically, what the standard pipeline looks like, where the failure modes are, and what to get right before going to production.
The problem with generic RAG in healthcare
The standard RAG pipeline is well known: take a corpus of documents, chunk them into passages, embed each passage into a vector, store the vectors in a database, then at query time embed the question, find the most similar passages, and pass them to a foundation model as context.
This works for company handbooks and product documentation. It breaks in healthcare for four reasons.
First, the corpus is not static. A patient's record changes daily as new observations, medications, and notes are added. Re-embedding the whole corpus every time is expensive and error-prone. Embedding just the changes requires a more sophisticated pipeline than most generic RAG implementations provide.
Second, the retrieval scope is per-patient, not corpus-wide. Asking a question about Patient A should never retrieve passages from Patient B's record, no matter how semantically similar. Generic vector stores do not enforce this boundary; it has to be designed in.
Third, clinical data is heavily structured. Medications, allergies, observations, and conditions are not free text. They are coded data with specific semantic meaning. Treating them as text to be embedded loses the structure that makes them useful.
Fourth, the stakes are different. A wrong answer in a handbook RAG is a minor inconvenience. A wrong answer in a clinical RAG can be a clinical incident. The accuracy bar is higher, and the architectural choices have to reflect that.
Why FHIR changes the equation
FHIR is the international standard for healthcare data interoperability, published by HL7. It defines structured resources (Patient, Observation, MedicationRequest, Condition, Procedure, AllergyIntolerance, and many more) with well-defined schemas, identifiers, and relationships.
For clinical AI, FHIR matters because it provides three things that generic RAG does not have.
A clean retrieval boundary per patient. Every FHIR resource is associated with a Patient resource. Retrieval can be scoped to a specific patient at the query level, before embedding similarity is even considered. This eliminates the cross-patient contamination problem entirely.
A structured representation of clinical data. A medication is not a sentence in a note. It is a MedicationRequest resource with a code, a dosage, a frequency, a prescriber, and a date. The foundation model can be given the structured data directly when that's what the question needs, and the unstructured notes only when needed.
A standard data model that survives vendor changes. FHIR is supported by all major hyperscale cloud providers (AWS HealthLake, Azure Health Data Services, Google Cloud Healthcare API), most major EHR vendors, and a wide ecosystem of open-source servers. Building on FHIR means the data layer is portable.
Recent academic work confirms what production teams have been finding. A 2025 study published as FHIR-RAG-MEDS demonstrated that integrating HL7 FHIR with RAG-based foundation models for clinical decision support produced consistently higher semantic accuracy, improved faithfulness to guideline content, and stronger clinical relevance than state-of-the-art medical foundation models alone, evaluated across 70 physician-generated clinical questions. The pattern is not theoretical; it is increasingly the default.
The FHIR plus RAG pipeline, end to end
A working clinical RAG pipeline has five stages. Each one has design choices that matter.
Stage 1: FHIR-native data store
The patient record lives in a FHIR-compliant store. This can be a managed service (AWS HealthLake, Azure Health Data Services, Google Cloud Healthcare API), a commercial FHIR server, or an open-source server like HAPI FHIR. The choice depends on existing infrastructure, data sovereignty requirements (covered in detail in our post on data sovereignty for ANZ healthcare AI), and integration with source systems.
The key property of this stage is that data is structured, queryable, and identifier-linked. Every resource is associated with a Patient, every observation has a timestamp, every medication has a code.
Stage 2: Embedding generation
Clinical content is converted to vectors using an embedding model. The choice of embedding model matters more here than in generic RAG, because clinical language is dense with domain-specific terms, abbreviations, and codes. A model trained on general web text will produce embeddings that group "myocardial infarction" with "heart attack" reasonably well, but may struggle with less common clinical terms.
Two practical decisions:
What to embed. Free-text clinical notes are the obvious candidate. But structured resources can also be embedded after being rendered to natural language, allowing semantic search across both. The trade-off is complexity versus completeness.
Per-patient scoping. Embeddings should be tagged with patient identifier at index time. This allows retrieval to filter by patient before similarity search, enforcing the privacy boundary architecturally rather than relying on application logic.
Stage 3: Vector storage
Vectors are indexed in a database that supports similarity search. The architectural choices here are well documented elsewhere (PGVector, OpenSearch, Pinecone, and many others all work). The healthcare-specific consideration is data residency: the vector store must be in the same jurisdiction as the source data, because vectors are derived data from protected health information.
Stage 4: Retrieval
This is where most clinical RAG implementations either succeed or fail.
At query time, the patient context is established first. This is non-negotiable: the user is asking a question about a specific patient, and retrieval must be scoped to that patient before any similarity search runs. Filter first, then search.
Within the scoped context, similarity search returns the most relevant passages or resources for the question. This is where good embedding choices pay off.
For complex queries, retrieval may happen in multiple stages: a first retrieval to identify relevant episodes or encounters, then a second retrieval within those to find specific data points. The orchestration libraries that support this kind of multi-step retrieval (LangChain, LlamaIndex, and others) make it tractable.
Stage 5: Generation
The foundation model receives the question and the retrieved context. It generates an answer, a summary, or a draft document.
Three things matter here. First, the prompt should constrain the model to only use the retrieved context, not its training data. Clinical hallucination from training data is a real risk. Second, the output should be structured where possible (JSON, FHIR-aligned formats) so it can be validated programmatically before being shown to a clinician. Third, citations back to the source resources should be preserved so the clinician can verify any claim.
The failure modes that catch teams off guard
Even with the right pipeline, several failure modes are common enough to call out explicitly.
Stale embeddings. A patient's record changes constantly. If the embedding index is only refreshed nightly, an AI summary requested in the afternoon may miss the morning's blood pressure reading. The fix is either real-time embedding updates or making the model aware of the embedding freshness so it can warn the user.
Cross-patient contamination. If retrieval scoping happens in application logic rather than at the database query level, a bug or misconfiguration can leak context from another patient. The fix is to enforce scoping at the database level, not in code that can be bypassed.
Hallucinated medications and allergies. Foundation models will sometimes invent plausible-sounding medications or allergies that are not in the retrieved context. The fix is to require the model to cite the specific FHIR resource for any clinical claim, and to validate citations programmatically before display.
Mismatched temporal context. A summary might combine data from across years without making the dates clear, leading a clinician to believe a past medication is current. The fix is to require date-aware output formatting and to validate that medications, observations, and conditions are presented with their dates.
Out-of-context generalisation. Foundation models trained on US clinical data may make recommendations that do not fit Australian or New Zealand clinical practice (different drug names, different guideline frameworks, different funding constraints). The fix is to ground recommendations in locally-relevant guidelines retrieved at query time, not in the model's training data.
Validation: how good is good enough?
A 2024 Australian study applying RAG to electronic health records in aged care settings found that zero-shot generative AI achieved 93.25% accuracy in summarising nutritional status from EHRs, and that adding RAG improved this to 99.25%. That is a meaningful gain, and it is consistent with the broader literature showing RAG substantially reduces hallucinations compared to standard foundation model outputs.
But "99% accurate" still means 1% wrong, and in a clinical setting that 1% needs to be caught by something. That something is the human-in-the-loop design, which is a topic in its own right and worth its own treatment.
The point for this post is that validation is not optional. Every clinical RAG system needs a documented evaluation methodology, ideally with both automated metrics (faithfulness, groundedness, citation accuracy) and clinician review. Without that, "accurate enough for production" is a claim, not a fact.
What to get right before you build
If you are about to build a clinical RAG system, four things will determine whether it succeeds.
Use FHIR-native storage from the start. Retrofitting FHIR onto a clinical AI system that was built on unstructured data is painful. Start with structured.
Enforce per-patient retrieval scoping at the database level. Application-level scoping is a security incident waiting to happen.
Choose embedding and foundation models that handle clinical language. General-purpose models work; domain-aware models work better. Benchmark on your actual data, not on someone else's.
Design the human-in-the-loop before you design the AI. The clinician is part of the system. Their workflow, their review burden, and their ability to override the AI all need to be designed alongside the technical pipeline, not bolted on at the end.
FHIR plus RAG is genuinely the right architecture for most clinical AI applications. It is not a silver bullet, and the failure modes are real, but the pattern is well understood and the tooling is mature enough to use in production. The teams that get clinical AI right are not the ones with the best models. They are the ones with the cleanest retrieval boundaries.
Easycoder is an AWS Advanced Partner working with healthcare providers, payers, and health technology companies across Australia and New Zealand on cloud, AI, and regulatory technology. If you are designing a clinical RAG system and want a second pair of eyes on the architecture, get in touch.



