CartoChrome computes Healthcare Access Scores for every populated ZIP code in the United States -- approximately 33,000 ZCTAs (ZIP Code Tabulation Areas) as defined by the Census Bureau. Each ZIP code receives 11 scores: one overall and 10 condition-specific. That is 363,000 scores, each derived from a 4-dimension spatial analysis algorithm that considers provider locations, facility quality, population demand, and social determinants of health. And the entire process runs on autopilot.
The Data Ingestion Layer
Our pipeline begins with 21 free, public data sources published by federal agencies. These sources update on different schedules -- weekly, monthly, quarterly, and annually -- and our ingestion layer handles each cadence automatically.
The core data sources include:
- CMS NPPES (National Plan and Provider Enumeration System) -- Updated monthly with a full file of ~7.5 million provider records, plus weekly delta files. We filter this to ~4 million active, patient-facing providers.
- CMS Hospital Compare -- Monthly quality metrics including star ratings, mortality rates, readmission rates, and patient experience scores for every Medicare-certified hospital.
- Census ACS 5-Year Estimates -- Annual demographic data at the ZCTA level: population, age distribution, income, insurance coverage, vehicle access, disability rates, and more.
- CDC PLACES -- Census tract-level health outcome measures used as our primary calibration target.
Each source has a dedicated Celery task that checks for new data, downloads it, validates the schema, and loads it into PostgreSQL with PostGIS spatial extensions. If a source is unavailable or returns malformed data, the pipeline logs an alert and retries -- it never silently proceeds with stale data.
The Scoring Engine
Check Your ZIP Code Health Score
See how your area compares across 11 health dimensions
Explore the MapOnce fresh data is ingested, the scoring engine runs the Enhanced Two-Step Floating Catchment Area (E2SFCA) computation. This is the computationally intensive core of the pipeline.
**Step 1: Supply-to-Demand Ratio.** For each provider or facility, the algorithm draws a catchment area (the maximum distance patients would reasonably travel) and computes the ratio of provider supply to population demand within that catchment. Catchment radii are calibrated per healthcare type and per urbanicity tier using RUCA codes -- a rural primary care physician serves patients within a 40-mile radius, while an urban specialist might draw from just 10 miles.
**Step 2: Access Score Summation.** For each ZIP code, the algorithm sums the supply-to-demand ratios of all providers whose catchment areas overlap that ZIP, applying distance-decay functions that weight nearby providers more heavily. Different healthcare types use different decay functions -- Gaussian for primary care, sigmoid for emergency services (reflecting the sharp survival drop-off beyond critical travel times), and logistic for specialists.
**Step 3: SDOH Penalty.** The raw access score is multiplied by a Social Determinants of Health penalty factor that ranges from 0.35 to 1.05. This penalty uses an exponential attenuation model across six sub-indices: insurance coverage, economic status, transportation access, health literacy, disability prevalence, and age vulnerability. The multiplicative structure means that compounding disadvantages -- being uninsured AND lacking transportation AND having high disability rates -- result in dramatically lower scores.
**Step 4: Normalization.** Raw scores are normalized to a 0-100 scale using a log-softcap at the 95th percentile. This preserves the full rank ordering while compressing the top 5% of ZIP codes into the 95-100 range, preventing a few exceptional ZIP codes from distorting the entire scale.
Automation Architecture
The pipeline is orchestrated by Celery with a Redis broker running on AWS. The task graph looks like this:
- Data ingestion tasks run on their respective schedules (weekly/monthly/quarterly/annually)
- Validation tasks confirm data freshness and schema compliance
- Geocoding tasks resolve any new provider addresses to lat/lng coordinates
- E2SFCA computation runs after any upstream data change, processing all 33,000 ZIP codes
- Score publication pushes updated scores to the API layer and triggers map tile regeneration
- Cache warming pre-populates Redis and CDN caches for high-traffic ZIP codes
The entire compute cycle -- from data ingestion to published scores -- takes approximately 4 hours on a single ECS Fargate task. Django management commands (`compute_e2sfca`, `refresh_health_scores`, `warmcache`) handle each stage.
Why Zero Manual Work Matters
Automation is not just a convenience -- it is a correctness guarantee. Manual data processing introduces human error at every step: downloading the wrong file, misaligning columns, forgetting to re-run downstream computations. By eliminating human intervention, we ensure that scores are always computed from the freshest available data using the exact same algorithm, every time.
It also means we can scale. Computing 363,000 scores monthly with a team of analysts would require dozens of person-hours. Our pipeline does it in 4 hours of compute time at a cost of approximately $2 per run.
