Anonymous, auditable, gz-compressed. One URL per source per snapshot.
Fonteum mirrors every federal-source snapshot to S3 with 90-day rolling retention. The bulk surface gives buyer dev teams, academic researchers, and AI agents one canonical URL per source family + per snapshot date — full 14-tuple provenance in response headers, sidecar manifest at manifest.json, and a cross-link to /verify/[snapshot-id] for SHA-256 hash-match against Fonteum’s integrity attestation.
Per-source + top-level. All anonymous.
Three endpoints per source family, plus one top-level discovery index. All anonymous (no Authorization header required), all returning the canonical X-Fonteum-* response headers.
- GET /api/v1/bulk/<source>/latest.csv.gz — 302 redirect to the most recent S3-cached snapshot.
Cache-Control: max-age=300(rolls daily as new snapshots ingest). - GET /api/v1/bulk/<source>/<YYYY-MM-DD>.csv.gz — 302 redirect to a specific dated snapshot. Immutable per snapshot —
Cache-Control: max-age=86400, immutable. Pin a date when you need reproducibility (cite specific snapshot in a paper, replay an analysis). - GET /api/v1/bulk/<source>/manifest.json — sidecar JSON listing every cached snapshot for the source with sha256 + size + verify_url + cached_at + retention_expires per row.
- GET /api/v1/bulk/manifest.json — top-level index across all 8 source families. One URL for DataCite harvesters, dbt source registration, evaluation scripts.
Phase 1 — 8 federal sources.
One row per source family registered in the cron-sources registry. Each maps to a source_id value usable in the URL paths above.
cms-pecos— CMS PECOS PPEF (Provider Enrollment, Chain & Ownership) · Weekly (Sunday) · license:US-Government-Worksoig-leie— OIG LEIE (List of Excluded Individuals/Entities) · Monthly (1st of month) · license:US-Government-Workshrsa-hpsa— HRSA HPSA (Health Professional Shortage Areas) · Quarterly (1st of Jan / Apr / Jul / Oct) · license:US-Government-Worksbls-oews— BLS OEWS (Occupational Employment & Wage Statistics) · Annual (mid-May) · license:US-Government-Worksbea-regional— BEA Regional Economic Accounts (state GDP) · Annual (mid-October) · license:US-Government-Workscms-nppes— CMS NPPES NPI Registry · Quarterly (per specialty, operator-triggered) · license:US-Government-Workscms-care-compare— CMS Care Compare (per facility type) · Quarterly (per facility type, operator-triggered) · license:US-Government-Workscms-open-payments— CMS Open Payments (Sunshine Act — General Payments) · Annual (mid-June / late June) · license:US-Government-Workscms-hcris-hospital-2552-10— CMS HCRIS Hospital Cost Reports (form CMS-2552-10) · Annual (operator-triggered ~November) · license:US-Government-Workscms-qpp-mips— CMS QPP MIPS Individual + Group Scores · Annual (operator-triggered ~July, post-performance-year scoring) · license:US-Government-Workscms-provider-utilization— CMS Medicare Provider Utilization & Payment Data (Physician & Other Practitioners by Provider and Service) · Annual (mid-June release of prior data year) · license:US-Government-Workscms-inpatient-utilization— CMS Medicare Inpatient Hospitals by Provider and Service · Annual (mid-June release of prior data year) · license:US-Government-Workscms-outpatient-utilization— CMS Medicare Outpatient Hospitals by Provider and Service · Annual (mid-June release of prior data year) · license:US-Government-Workshrsa-uds— HRSA Uniform Data System (UDS) · Annual (May, post-grant-year reporting) · license:US-Government-Workscms-pos— CMS Provider of Services (POS) — iQIES Facility Registry · Quarterly (operator-triggered; CMS publishes Q1–Q4) · license:US-Government-Works
Phase 1 ships gzipped CSV. Parquet + JSON Lines queued.
Each archive is the upstream source CSV exactly as captured at ingestion time, gzip-compressed (application/gzip). Header rows + column ordering match the upstream source — Fonteum does not normalize, dedupe, or transform.
Phase 2 format alternatives queued separately: Parquet (§sprint3-bulk-export-parquet), JSON Lines (§sprint3-bulk-export-jsonl), FHIR Bundle (§sprint3-bulk-export-fhir-bundle), partitioned per-state / per-vertical (§sprint3-bulk-export-partitioned).
14-tuple provenance + hash-match cross-link.
Every 302 redirect carries:
X-Fonteum-Source—source_idof the datasetX-Fonteum-Snapshot-Date— ISO-8601 date of the snapshotX-Fonteum-SHA256— 64-char lowercase hex; matchessnapshot_attestations.content_hashX-Fonteum-License— SPDX identifier (e.g.US-Government-Works,CC-BY-4.0)X-Fonteum-Cite— citation format URL (/cite)X-Fonteum-Verify— hash-match endpoint (/verify/<snapshot-id>)Link— sidecar manifest URL withrel="describedby"
Hash-match against the snapshot attestation.
The bulk surface is content-addressable: every snapshot has one SHA-256, that hash is attested in snapshot_attestations at ingestion time, and both the response header and the /verify/[snapshot-id] endpoint return the same value. Defense in depth — one consumer, three independent hash-match paths.
# 1. Read the manifest to find the latest snapshot date + SHA-256
curl -s https://fonteum.com/api/v1/bulk/cms-pecos/manifest.json \
| jq '.snapshots[0] | {snapshot_date, sha256, cache_url}'
# 2. Download the gzipped CSV (302-redirect resolves to S3)
curl -L -o pecos-latest.csv.gz \
https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz
# 3. Recompute the SHA-256 locally and compare to the header value
EXPECTED=$(curl -sI https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz \
| awk -F': ' 'tolower($1)=="x-fonteum-sha256" {print tolower($2)}' \
| tr -d '\r')
ACTUAL=$(shasum -a 256 pecos-latest.csv.gz | awk '{print $1}')
[ "$EXPECTED" = "$ACTUAL" ] && echo "ok" || echo "MISMATCH"
# 4. Cross-check against the /verify endpoint (defense in depth)
SNAPSHOT_ID=$(curl -sI https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz \
| awk -F'/' 'tolower($1)~/x-fonteum-verify/ {print $NF}' | tr -d '\r')
curl -s -H 'Accept: text/plain' https://fonteum.com/verify/$SNAPSHOT_ID
# returns the 64-char hex hash; should equal $EXPECTED + $ACTUALurllib + gzip + hashlib.
# Python 3.10+ (stdlib only — urllib + gzip + hashlib)
import gzip
import hashlib
import urllib.request
URL = "https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz"
# 302 redirect resolves automatically; capture headers from the response
req = urllib.request.Request(URL)
with urllib.request.urlopen(req) as resp:
raw = resp.read()
expected_sha = resp.headers.get("X-Fonteum-SHA256", "").lower()
snapshot_date = resp.headers.get("X-Fonteum-Snapshot-Date", "")
# Verify the hash matches what Fonteum signed
actual_sha = hashlib.sha256(raw).hexdigest()
assert actual_sha == expected_sha, f"SHA mismatch: {actual_sha} != {expected_sha}"
# Decompress + iterate
import io, csv
with gzip.GzipFile(fileobj=io.BytesIO(raw), mode="rb") as gz:
reader = csv.DictReader(io.TextIOWrapper(gz, encoding="utf-8"))
for row in reader:
# ... your analysis here ...
pass
print(f"Hash-matched snapshot {snapshot_date} ({len(raw):,} bytes gzipped)")readr + digest + httr.
# R 4.0+ (readr + digest + httr)
library(readr)
library(digest)
library(httr)
url <- "https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz"
# httr::GET follows 302 by default + exposes headers
res <- GET(url)
stop_for_status(res)
expected_sha <- tolower(headers(res)[["x-fonteum-sha256"]])
raw <- content(res, "raw")
# Recompute SHA-256 over the gzipped bytes (same as the header)
actual_sha <- digest(raw, algo = "sha256", serialize = FALSE)
stopifnot(expected_sha == actual_sha)
# readr can read gzipped CSV directly from a connection
df <- read_csv(rawConnection(raw))
message(sprintf("Hash-matched snapshot — %d rows, %d cols", nrow(df), ncol(df)))stdlib fetch + crypto + zlib.
// Node 18+ (built-in fetch + crypto + zlib)
import { createHash } from "node:crypto";
import { gunzipSync } from "node:zlib";
const URL = "https://fonteum.com/api/v1/bulk/cms-pecos/latest.csv.gz";
const res = await fetch(URL, { redirect: "follow" });
if (!res.ok) throw new Error(`HTTP ${res.status}`);
const expected = res.headers.get("x-fonteum-sha256")?.toLowerCase();
const buf = Buffer.from(await res.arrayBuffer());
const actual = createHash("sha256").update(buf).digest("hex");
if (actual !== expected) throw new Error(`sha mismatch ${actual} != ${expected}`);
const csv = gunzipSync(buf).toString("utf-8");
console.log(`Hash-matched ${csv.split("\n").length} rows`);Per-source SPDX. Cite Fonteum when used in publications.
Federal sources (CMS, OIG, HRSA, BLS, BEA) are US-Government-Works — public domain, redistribution allowed. Fonteum-derived datasets carry CC-BY-4.0 requiring attribution. The X-Fonteum-License header surfaces the SPDX value on every response.
For papers, theses, dashboards, or commercial products: cite Fonteum per /cite (APA + AMA + BibTeX). Pin the snapshot_date when reproducibility matters; a dated URL is immutable.
Phase 1 ships gzipped CSV. Phase 2 — formats + partitions.
- Phase 1 (this wave): 8 source families × (latest + dated + manifest) endpoints + top-level manifest + /data-catalog distribution surfacing.
- §sprint3-bulk-export-parquet (queued): Parquet format alternative — same URLs with
.parquetsuffix. - §sprint3-bulk-export-jsonl (queued): JSON Lines format —
.jsonl.gz. - §sprint3-bulk-export-partitioned (queued): per-state + per-vertical splits.
- §sprint3-bulk-export-fhir-bundle (queued): FHIR Bundle format for FHIR-aligned consumers.
- §sprint3-datacite-bulk-listing (queued): DataCite metadata harvester registration so federal catalogs auto-discover the bulk surface.