The provider graph behind Fonteum.
Every displayed field needs a source, a date, and a display rule. This page is the public map of how Fonteum turns federal and state public records into source-cited provider data — with the limitations CMS itself documents.
Today the graph carries registered sources, per-business source matches, and field-level provenance rows.
- Provenance register
- v1.1
- Last reviewed
- 2026-05-08
- Snapshot
- 2026-05-09
What is in the graph today
Snapshot 2026-05-09- 113,537providers trackedActive rows in `businesses` across all 40 verticals
- 40active verticalsEach carries its own archetype + display contract
- 12registered sourcesFederal + state public-record datasets · 5 Tier-2 profile-enrichment
- 574provider-source matchesPer-business source-record links recorded with confidence
- 2,758provenance field rowsEach row carries source + last-checked-date + value
- 9
Seven stages from public source to displayed field.
Every fact Fonteum shows traces back through these seven stages. If a stage is missing — no source, no manifest, no ingestion run, no match, no provenance row, no display permission — the fact does not render.
- STAGE 01
Public source
An official, public-record dataset published by a federal or state agency: NPPES, CMS PECOS, FL DBPR, CMS Care Compare, BLS, HRSA, US Census.
data.cms.gov · myfloridalicense.com · npiregistry.cms.hhs.gov - STAGE 02
Source pack
A typed manifest for each public source. Names the slug, tier, allowed verticals + states, fetch method, ToS notes, refresh cadence, and field-level display rules.
src/lib/data-ingestion/source-packs/*.ts - STAGE 03
Ingestion run
A dated execution of a pilot script that downloads or queries the source. Records fetch method (rest-api / bulk-file / manual-csv), record count, and run notes.
ingestion_runs - STAGE 04
Entity match
Per-business link from a source record to a platform business via name + city + state + phone + ZIP jaccard scoring. Confidence ≥ 0.75 to write; ambiguous cases logged but not displayed.
provider_source_matches · entity_match_log - STAGE 05
Field provenance
One row per (business, source, field). Carries the source value, normalized value, last-checked date, ingestion-run pointer, and the display_allowed flag from the manifest.
provider_field_provenance
Four source families. One provenance contract.
We organize Fonteum by the public datasets that back it — not by a flat list of 40 vertical domains. Every cluster ships with a registered source, an ingestion runner, and a published research asset.
- Cluster
Healthcare graph
NPPES + CMS PECOS link providers to the federal NPI registry and Medicare-billing-active records. The taxonomy + enrollment surface that source-backs every active hero vertical.
Source:CMS NPPESChecked May 2026Source:CMS PECOSChecked May 2026
Four real examples — one per source family.
Each card shows the field as it would render on the relevant Fonteum surface, sourced verbatim from the public dataset, with the “what it means” and “what it does not mean” framing CMS itself documents.
When a field is in the database but not on the page.
The same data that flows through stage 5 (field provenance) does not automatically flow through to stage 7 (display). Four rules govern which fields render, and where.
- Rule 1
Why don't BLS / BEA / HRSA / Census numbers appear on individual provider profiles?
Aggregate datasets (BLS Occupational Employment & Wage Statistics, BEA Regional Income, HRSA Health Professional Shortage Areas, US Census state population) describe markets, not individuals. Attaching a state-mean wage figure to one HVAC contractor's listing would imply a per-business signal the data doesn't carry. They render only on /research aggregate pages.
How it's enforcedSource-pack manifest tier='tier1-research-only' + the data:bls-bea-not-in-profile-components launch gate. - Rule 2
Why don't low-confidence matches display?
Each match between a public-record row and an Fonteum business is scored by name + city + state + phone + ZIP jaccard similarity. Confidence below 0.75 is logged in entity_match_log but never written to provider_field_provenance. Ambiguous matches (top-2 within 0.05 + both ≥ 0.75) are suppressed too — when CMS lists two providers under similar names in the same city and we can't disambiguate, no row is written for either.
How it's enforcedMatch-confidence threshold encoded in every source-pack manifest (confidence.threshold_high). Mirrored to the entity_match_log audit trail (currently 2,001 logged attempts). - Rule 3
Why are some fields write-locked even when CMS publishes them?
Five classes the dataset can fall into.
Every field on every page maps to one of five source classes. The class determines storage rights, refresh cadence, and what attribution must appear next to the value.
- PUBLIC_RECORD
Public record (federal / state)
Sourced from federal or state public-record datasets. No copyright restriction. Safe to store and display indefinitely with attribution. Backs every Tier-2 source-pack.
Examples: CMS NPPES NPI registry, CMS PECOS Medicare enrollment, CMS Care Compare, FL DBPR construction-license file, CSLB CPRA responses, AZ ROC public-records data.
- RESEARCH_AGGREGATE
Research aggregate (Tier-1)
Sourced from federal aggregate datasets that describe markets, not individuals. Renders only on /research surfaces — never attached to per-business profiles.
Examples: BLS Occupational Employment & Wage Statistics (OEWS), BLS QCEW, BEA Regional, HRSA HPSA, US Census state population.
- OWNED
Owned by Fonteum
Generated or assigned by Fonteum's own systems. Full rights to store and display indefinitely.
Examples: Listing identifiers, slugs, vertical IDs, claim flow timestamps, tier flags.
- OWNER_SUBMITTED
Owner submitted
Provided by a logged-in business owner via the claim or owner-portal flow. Storage rights granted by the owner via the submission terms.
Name, address, phone, website. Independently public-record where the business is operating publicly, but the canonical source on unclaimed listings is Google Business Profiles.
| Field | Source (unclaimed) | Source (claimed) | Retention | Refresh |
|---|---|---|---|---|
| Business name | Google (cached) | Owner submitted | Indefinite, with refresh | On scheduled snapshot |
| Address | Google (cached) | Owner submitted | Indefinite, with refresh | On scheduled snapshot |
| Phone | Google (cached) | Owner submitted | Indefinite, with refresh | On scheduled snapshot |
| Website | Google (cached) | Owner submitted | Indefinite, with refresh | On scheduled snapshot |
| Not displayed | Owner submitted | Indefinite | Owner-driven |
Per-business fields written via the §94/§104 provenance framework. Each row in provider_field_provenance carries source + last-checked date + display permission.
| Field | Source (unclaimed) | Source (claimed) | Retention | Refresh |
|---|---|---|---|---|
| NPI (NPPES) | CMS NPPES (matched) | CMS NPPES (matched) | Indefinite, refreshed quarterly | Manifest refresh_cadence_days |
| Provider taxonomy code (NPPES) | CMS NPPES (matched) | CMS NPPES (matched) | Indefinite | Quarterly |
| Medicare PECOS enrollment | CMS PECOS (matched) | CMS PECOS (matched) | Indefinite | Monthly per CMS publish cadence |
| State contractor license # | FL DBPR / AZ ROC / CSLB (matched) | Same | Indefinite | Monthly to quarterly per source |
| CMS overall star rating (where applicable) | CMS Care Compare (matched) | Same | Indefinite | Quarterly |
Rating + review count. Sourced from Google. Aggregated only at the per-business level — never aggregated to the directory level.
| Field | Source (unclaimed) | Source (claimed) | Retention | Refresh |
|---|---|---|---|---|
| Rating | Google (cached) | Google (cached) | TTL-cached, refreshed on snapshot | Weekly for highly-rated segment; on scheduled snapshot otherwise |
| Review count | Google (cached) | Google (cached) | TTL-cached, refreshed on snapshot | Weekly for highly-rated segment; on scheduled snapshot otherwise |
| Individual review text or author | Not stored | Not stored | n/a | n/a |
Internal flags, identifiers, claim history, billing tier. Generated by Fonteum; never shared with third parties without consent.
| Field | Source (unclaimed) | Source (claimed) | Retention | Refresh |
|---|---|---|---|---|
| Listing identifier (UUID, slug) | Owned by Fonteum | Owned by Fonteum | Indefinite | On rename only |
| Claim status, tier, billing | n/a | Owned by Fonteum | Indefinite while claimed | On owner action |
| Created / updated timestamps | Owned by Fonteum | Owned by Fonteum | Indefinite | Auto |
We cite source-backed facts. We do not pay-to-rank, issue trust badges, invent ratings, award providers, or claim license credentials we cannot trace.
Every disclaimer in the homepage “What Fonteum does NOT claim” block applies on every /data-provenance, /research, and per-business surface. Fonteum does not independently rate, inspect, verify, endorse, or guarantee any provider — we cite the CMS, NPPES, FL DBPR, and other public-record sources that already measure them.
- Sources → The full source library — every dataset Fonteum cites, with tier, refresh cadence, fields used, and limitations.
- Methodology → Network-wide sourcing, refresh cadence, and corrections policy.
- Home Health methodology → The §118-§119 staged-vertical doctrine for CMS Care Compare data.
- Research → Tier-1 dated aggregates published from the source graph.
- Editorial policy → Independence, sourcing, conflicts, corrections, retractions.
- Corrections log → Every accepted correction, dated, with the cause named.
- Data & press kit → Cite our data, request a custom export, or reach press.
Compliance posture
We don’t sell ranking and don’t accept payment to move a provider up the list. For final hire decisions, verify licensing, insurance, and references directly with the applicable licensing or credentialing body.
No bulk-licensing source family is currently ingested for this vertical. Hire-time checking still routes through the body named above.