USE CASE · PROVIDER DATA FOR AI & RAG

Ground your AI agent in provider data it can cite back to the source.

A chunk-ready federal provider corpus with stable IDs and source metadata where available. Provenance fields are nullable and vary by endpoint, so governance reviews must inspect the fields actually returned.

Request access →Compliance & risk teams See the RAG chunks API →Developers & AI teams

providers within the active production registry (registry status is not completeness, load status, or freshness)Source-specific provenanceREST + MCP

✓ No PHI✓ Citation-stable chunk IDs✓ 14-field provenance✓ Deterministic re-pull

The AI-governance exposure

An ungrounded provider fact is a liability.

When a model answers with a provider's enrollment status, a sanction, or an NPI and cannot point to a source, that answer cannot be defended. In healthcare, an invented or stale provider fact is not a typo — it is a claim no one can stand behind. The fix is not a bigger model; it is retrieval where each chunk already carries its source and date.

Fonteum exposes source, snapshot, and methodology fields where the chunk builder has them. A grounded answer should cite only those returned fields; historical re-derivation also depends on whether the named source retains the relevant version.

The developer pain

Reference-heavy FHIR and index churn.

Raw FHIR JSON is reference-heavy — a Practitioner points to a PractitionerRole that points to an Organization that points to a Location — so building one chunk of context takes several round trips, and the coding-system URIs and extension blocks bloat every token budget. Worse, the raw files carry no chunk identity, so each re-pull re-embeds everything and churns your vector store.

One paginated GET returns pre-resolved, pre-chunked text with a stable chunk_id and the provenance block inline. The flattening, the chunking, and the citation are the response — not something you reconstruct downstream.

How it works

The corpus shape, and how a chunk stays citable.

Flat

Pre-resolved

Nested FHIR references are resolved server-side into a flat, fully-populated chunk — no client-side round trips to assemble context.

Stable

Chunk IDs

A deterministic chunk_id per record. Use it as the vector primary key; a re-pull upserts in place instead of duplicating vectors.

Nullable

Provenance fields

Source, source URL, snapshot date, last-checked date, methodology, and confidence appear where available; callers must preserve nulls and gaps.

MCP

Live agent path

The same graph is exposed as Model Context Protocol tools, so an agent can query live with the same provenance the static index carries.

Integration & workflow

One corpus. Two ways in.

Developers page the chunks endpoint into a vector store. Governance owners stand up a corpus where every retrieved fact carries a re-derivable citation. Same data, same provenance, same source snapshot.

GET /api/v1/rag/chunks

curl "https://fonteum.com/api/v1/rag/chunks?limit=50&cursor=0" \
  -H "Accept: application/json"

Response

{
  "total": 124817,
  "next_cursor": 50,
  "chunks": [
    {
      "chunk_id": "source:nppes#overview",
      "text": "The CMS NPPES registry enumerates US healthcare providers ...",
      "cite": "CMS NPPES, snapshot 2026-05-01",
      "source_url": "https://npiregistry.cms.hhs.gov/",
      "dataset_id": "nppes/v1",
      "provenance": {
        "_source": "CMS NPPES",
        "_source_url": "https://npiregistry.cms.hhs.gov/",
        "_snapshot": "2026-05-01",
        "_last_checked": "2026-06-17",
        "_methodology": "rag-chunks/v1",
        "_confidence": 1.0
      }
    }
  ]
}

Public endpoint, rate limited per source IP. Map each chunk to a LangChain Document or LlamaIndex TextNode — chunk_id as the id, text as the embed body, the provenance block as metadata. Full LangChain / LlamaIndex / MCP walkthroughs live in /docs/integrations.

01Page the corpus with ?limit= and the returned next_cursor — one deterministic GET per page, no language model in the pipeline.
02Map each chunk to a vector node: chunk_id as the primary key, text as the embed body, the provenance block as metadata.
03Embed and index. Because chunk IDs are stable, a later re-pull upserts in place rather than duplicating vectors.
04At answer time, carry the chunk's cite string and source_url into the response so every grounded fact footnotes its federal source.
05Keep the provenance record with the answer — the audit trail that shows a model-surfaced provider fact is real, dated, and re-derivable.

Sample audit-evidence artifact

RAG CITATION — PROVENANCE RECORD
Answer fact ......... NPI 1003894328 is enrolled in Medicare
Grounded on chunk ... source:pecos#1003894328
Source .............. CMS PECOS
Source URL .......... https://data.cms.gov/provider-enrollment
Source snapshot ..... 2026-05-01
Last checked (UTC) .. 2026-06-17T14:02:11Z
Methodology ......... rag-chunks/v1
Re-derivable ........ GET /api/v1/rag/chunks (deterministic chunk_id)

The provenance record is the governance artifact: it shows a model-surfaced provider fact, the chunk it was grounded on, and the federal source and snapshot it re-derives from.

Proof — not logos

Chunks expose the public-source metadata available to the endpoint.

Providers

Unique NPIs enumerated from the CMS NPI Registry — the identity backbone the corpus is chunked from.

Federal-source documentation

A documented catalog within the active production registry; registry status is not completeness, load status, or freshness. A retrieved chunk identifies its originating source and available snapshot metadata; coverage is bounded by the datasets exposed through this endpoint.

rag-chunks/v1

Methodology

Chunks can identify the chunking methodology version; reproducing a corpus also requires the same retained inputs and source release.

Source-cited

Evidence

Chunks carry their source and snapshot metadata when available. The 26.2M fact ledger currently has no deterministic claim-to-signature link.

“A retrieved fact is only as good as the source it carries. The chunk ID, the source, and the snapshot are the product.”

PROVIDER DATA FOR AI & RAG

Index a citation-stable provider corpus over REST or MCP.

Request access → See a live sample response

Questions

Before the security questionnaire.

Why not just embed the raw NPPES and CMS files myself?

You can. The raw FHIR and CSV files are reference-heavy and carry no chunk identity, so a re-pull can churn your vector index. Fonteum returns pre-resolved, pre-chunked text with a stable chunk_id and the provenance fields available for that chunk; callers must handle null or absent source fields.

How does a chunk keep its citation?

A chunk can include source name, source URL, snapshot date, last-checked date, methodology version, and a cite string. Availability varies by source and endpoint, so carry only the metadata actually returned and do not infer missing fields.

Will the chunk IDs stay stable across pulls?

Yes. Chunk IDs are deterministic — the same record produces the same chunk_id on every pull, so an incremental re-pull upserts cleanly instead of duplicating vectors. Re-embedding only touches chunks whose underlying federal record actually changed.

Can my agent call this over MCP instead of REST?

Yes. The same provider graph is exposed as Model Context Protocol tools — search, resolve-by-NPI, exclusion check, dataset info, source list. MCP and REST responses expose route-specific provenance fields rather than a guaranteed tuple on every fact.

Is any patient data in the corpus?

No. The corpus is built only from public federal and state provider records keyed by NPI, CCN, and PECOS-ID. There is no PHI in the pipeline and none is required to index, retrieve, or cite a provider fact.

Go deeper

The data and platform behind the corpus.

Solutions

All solutions — by use case & buyer →

Guide

Healthcare data for AI / RAG — the full guide →

Platform

The capability layer — API, MCP, FHIR →

Developers

Docs, quickstart, and SDKs — the dev hub →

Data catalog

NPPES provider registry →

Data catalog

Browse catalogued federal datasets →

Research

Original studies built on the graph →

Use case

Credentialing & provider-data enrichment →

Use case

Exclusion & sanctions screening →

For developers

Fonteum for developers & AI teams →

FONTEUM · PROVIDER DATA FOR AI

Ground your agent on public data only. No PHI.

Request access → Read the API docs

USE CASE · PROVIDER DATA FOR AI & RAG

Ground your AI agent in provider data it can cite back to the source.

Request access →Compliance & risk teams See the RAG chunks API →Developers & AI teams

providers within the active production registry (registry status is not completeness, load status, or freshness)Source-specific provenanceREST + MCP

✓ No PHI✓ Citation-stable chunk IDs✓ 14-field provenance✓ Deterministic re-pull

The AI-governance exposure

An ungrounded provider fact is a liability.

The developer pain

Reference-heavy FHIR and index churn.

How it works

The corpus shape, and how a chunk stays citable.

Flat

Pre-resolved

Nested FHIR references are resolved server-side into a flat, fully-populated chunk — no client-side round trips to assemble context.

Stable

Chunk IDs

A deterministic chunk_id per record. Use it as the vector primary key; a re-pull upserts in place instead of duplicating vectors.

Nullable

Provenance fields

Source, source URL, snapshot date, last-checked date, methodology, and confidence appear where available; callers must preserve nulls and gaps.

MCP

Live agent path

The same graph is exposed as Model Context Protocol tools, so an agent can query live with the same provenance the static index carries.

Integration & workflow

One corpus. Two ways in.

GET /api/v1/rag/chunks

curl "https://fonteum.com/api/v1/rag/chunks?limit=50&cursor=0" \
  -H "Accept: application/json"

Response

{
  "total": 124817,
  "next_cursor": 50,
  "chunks": [
    {
      "chunk_id": "source:nppes#overview",
      "text": "The CMS NPPES registry enumerates US healthcare providers ...",
      "cite": "CMS NPPES, snapshot 2026-05-01",
      "source_url": "https://npiregistry.cms.hhs.gov/",
      "dataset_id": "nppes/v1",
      "provenance": {
        "_source": "CMS NPPES",
        "_source_url": "https://npiregistry.cms.hhs.gov/",
        "_snapshot": "2026-05-01",
        "_last_checked": "2026-06-17",
        "_methodology": "rag-chunks/v1",
        "_confidence": 1.0
      }
    }
  ]
}

01Page the corpus with ?limit= and the returned next_cursor — one deterministic GET per page, no language model in the pipeline.
02Map each chunk to a vector node: chunk_id as the primary key, text as the embed body, the provenance block as metadata.
03Embed and index. Because chunk IDs are stable, a later re-pull upserts in place rather than duplicating vectors.
04At answer time, carry the chunk's cite string and source_url into the response so every grounded fact footnotes its federal source.
05Keep the provenance record with the answer — the audit trail that shows a model-surfaced provider fact is real, dated, and re-derivable.

Sample audit-evidence artifact

RAG CITATION — PROVENANCE RECORD
Answer fact ......... NPI 1003894328 is enrolled in Medicare
Grounded on chunk ... source:pecos#1003894328
Source .............. CMS PECOS
Source URL .......... https://data.cms.gov/provider-enrollment
Source snapshot ..... 2026-05-01
Last checked (UTC) .. 2026-06-17T14:02:11Z
Methodology ......... rag-chunks/v1
Re-derivable ........ GET /api/v1/rag/chunks (deterministic chunk_id)

The provenance record is the governance artifact: it shows a model-surfaced provider fact, the chunk it was grounded on, and the federal source and snapshot it re-derives from.

Proof — not logos

Chunks expose the public-source metadata available to the endpoint.

Providers

Unique NPIs enumerated from the CMS NPI Registry — the identity backbone the corpus is chunked from.

Federal-source documentation

rag-chunks/v1

Methodology

Chunks can identify the chunking methodology version; reproducing a corpus also requires the same retained inputs and source release.

Source-cited

Evidence

Chunks carry their source and snapshot metadata when available. The 26.2M fact ledger currently has no deterministic claim-to-signature link.

“A retrieved fact is only as good as the source it carries. The chunk ID, the source, and the snapshot are the product.”

PROVIDER DATA FOR AI & RAG

Index a citation-stable provider corpus over REST or MCP.

Request access → See a live sample response

Questions

Before the security questionnaire.

Why not just embed the raw NPPES and CMS files myself?

How does a chunk keep its citation?

Will the chunk IDs stay stable across pulls?

Can my agent call this over MCP instead of REST?

Is any patient data in the corpus?

Go deeper

The data and platform behind the corpus.

Solutions

All solutions — by use case & buyer →

Guide

Healthcare data for AI / RAG — the full guide →

Platform

The capability layer — API, MCP, FHIR →

Developers

Docs, quickstart, and SDKs — the dev hub →

Data catalog

NPPES provider registry →

Data catalog

Browse catalogued federal datasets →

Research

Original studies built on the graph →

Use case

Credentialing & provider-data enrichment →

Use case

Exclusion & sanctions screening →

For developers

Fonteum for developers & AI teams →

FONTEUM · PROVIDER DATA FOR AI

Ground your agent on public data only. No PHI.

Request access → Read the API docs