Methodology
ETL pipeline, source snapshots, and computed signals
Transparent end-to-end description of how Housing BuildDesignHub turns public datasets into source-attributed market profiles.
Housing BuildDesignHub currently uses source-attributed static fixtures to validate the data model and UI. Metrics are not live and should not be treated as investment, legal, financial, or property-level advice.
The pipeline
Four steps, in order, every run.
Eight independent federal data agencies feed this project's data layer. The diagram is illustrative; each agency publishes its own data on its own cadence under its own terms.
Step 1
Extract
Each registered source adapter loads a canonical SourceSnapshot for every supported geography. Snapshots carry an id, source id, dataset name, period, retrieved/release dates, source URL, update frequency, confidence, and the raw metric values themselves.
Step 2
Transform
The transform step assembles a NormalizedMarketProfile per geography: metrics keyed by stable ids (population, medianGrossRent, homePriceIndex, …), computed signal scores, an aggregated source attribution roster, freshness info, and a coverage report.
Step 3
Validate
Four validators run on the pipeline output: freshness (age vs. cadence), completeness (required metrics per report), geography (required identity fields), and metrics (units present, no impossible values). Findings are warnings — not build failures — and surface on /data-status.
Step 4
Load
Phase 3 'load' writes profiles into an in-memory store keyed by geography id. The interface is intentionally db-shaped so a Postgres / KV / S3 loader can replace it without changing the rest of the pipeline.
Why public data
Background from the U.S. Census Bureau on the American Community Survey — the survey program behind the ACS 5-year tables this project uses. Embedded as agency context, not as a partnership or endorsement.
Core concepts
Source snapshots
The unit of work. A snapshot is a serializable record of one source's output for one geography at one point in time. Snapshots can be cached, persisted, replayed, and validated independently.
Normalized market profiles
What the UI consumes. A profile is a per-geography object combining raw metrics, computed signals, source attributions, freshness, and coverage — built deterministically from the snapshot set.
Computed scores
Affordability, supply, price momentum, job market, and climate risk are derived from raw inputs using fixed, documented formulas. Each score is marked sourceId="computed" so it is never confused with an observed value.
Confidence levels
Each source carries a confidence rating — high, medium, or experimental — based on coverage, stability, and methodology. Signals leaning on experimental inputs are flagged.
Fixture vs live data
The current pipeline ships official-source fixtures (deterministic JSON snapshots whose shape matches the public release). The UI labels every metric's provenance as 'fixture' or 'live' — never as 'real-time'. Swapping to live data is a per-adapter change.
Future scheduled ETL
The same extract/transform/validate/load contract will drive a scheduled ETL run (Census API → adapter → snapshot store → profile store). The UI does not need to know whether snapshots came from a fixture or a scheduled run — only that they passed validation.
Climate risk — separate from opportunity
The climateRisk signal mirrors FEMA's National Risk Index: higher = more hazard exposure. To keep semantics from blurring, climateRisk is shown on its own (never inverted) and only enters the overall composite as a 0-20 point penalty. The underlying mapping is county-level — Phase 4A does not yet use the tract-level NRI release.
Signal formulas
Affordability signal
computed signalrentBurden = annualGrossRent / medianHouseholdIncome; score = clamp(round((0.30 − rentBurden) / 0.30 × 100 + 50), 0, 100)
rentBurden = 0.30 → 50. Lower burden → higher score.
Inputs: ACS medianGrossRent, ACS medianHouseholdIncome
Supply signal
computed signalintensity = (trailing12PermitTotal / housingUnits) × 1,000; score = clamp(round(intensity / 0.6), 0, 100)
~60 permits per 1,000 existing units per year ≈ 100.
Inputs: Census BPS trailing-12 permits, ACS housing units
Price momentum signal
computed signalscore = clamp(round(50 + hpiYoYChangePct × 5), 0, 100)
0% YoY → 50. +10% → 100. −10% → 0.
Inputs: FHFA HPI YoY % change
Climate risk signal
computed signalscore = round(femaNriRiskIndex) // 0-100, higher = more hazard exposure
Surfaces FEMA's National Risk Index directly — no transformation. Direction follows FEMA: higher means more exposure. Kept separate from the positive signals so the semantics don't blur.
Inputs: FEMA NRI naturalHazardRiskIndex (county-level)
Job market signal
computed signallevel = clamp((8.5 − unemploymentRate) / (8.5 − 2.5) × 100, 0, 100); trend = clamp((1.0 − unemploymentRateChange12m) / 2.0 × 100, 0, 100) (missing → 50); jobMarket = round(level × 0.7 + trend × 0.3)
Positive 0-100 signal. Lower unemployment and a falling 12-month trend both raise it. Level weighted 70%, trend 30%. ≤2.5% → 100 on level; ≥8.5% → 0. −1.0 ppt YoY → 100 on trend; +1.0 ppt → 0.
Inputs: BLS LAUS unemploymentRate, unemploymentRateChange12m (county-level)
Economic strength signal
computed signalincomeGrowth = clamp((pcpiGrowthYoY + 2) / 8 × 100, 0, 100) [weight 0.45]; incomeLevel = clamp((perCapitaPersonalIncome − 40000) / 60000 × 100, 0, 100) [weight 0.35]; gdpGrowth = clamp((regionalGdpGrowthYoY + 2) / 8 × 100, 0, 100) [weight 0.20]; if gdpGrowth missing, renormalize the two remaining weights; economicStrength = round(weighted sum)
Positive 0-100 signal. Higher per-capita income, faster income growth, and faster regional GDP growth all raise it. Direction follows BEA: higher = stronger regional economic context. NOT investment advice.
Inputs: BEA Regional perCapitaPersonalIncome, perCapitaPersonalIncomeGrowthYoY, and (optionally) regionalGdpGrowthYoY (MSA-level)
Overall score
computed signalpositives = avg of available {affordability, supply, priceMomentum, jobMarket, economicStrength}; penalty = (climateRisk / 100) × 20; overall = clamp(positives − penalty, 0, 100)
Average of available positive sub-scores, minus a climate-risk penalty (max 20 points at FEMA risk 100). Missing positives are skipped (not substituted); missing climateRisk means no penalty. Direction stays consistent: higher overall = better risk-adjusted opportunity.
Inputs: Whichever sub-scores are present, plus climateRisk if FEMA NRI is loaded
Known signal-formula limitations
Supply score — MSA numerator over city denominator
The supply signal is computed as permits_trailing_12 / housing_units × 1000 then scaled to a 0–100 score. Census BPS reports permits at the MSA level, while Census ACS reports housing units at the city / place level. For cities where the metro is much larger than the city proper, this puts a multi-county permit count over a single-city housing stock, inflating the ratio and clipping the score at the methodology ceiling of 100.
Markets currently at the supply ceiling under this formula include Austin, Dallas, Phoenix, Miami, Tampa, Boston, Washington DC, and Atlanta. Markets where the city proper is large enough to keep the ratio in range (Los Angeles, New York, Chicago, San Diego, San Francisco) score normally. This is a denominator-choice limitation, not a statement that the affected cities have the strongest new-construction supply in the country.
When the supply score is at the ceiling, the city / housing-market / investment-score pages surface a small inline disclosure linking back to this section.
Affordability score — rent-to-income only
The affordability signal uses Census ACS median gross rent and median household income — a rent-burden measure. It does not include a home-value-to-income term. Cities with extreme home prices but moderate ACS rent (San Francisco, New York) can score affordably under this measure even though home-purchase affordability is dramatically worse than the rent burden suggests. This is consistent with how rent-burden is reported by HUD and ACS, but consumers should not read a high affordability score as a home-purchase recommendation.
Climate risk — county-level aggregation
FEMA NRI is consumed at the county level in this project, not the census-tract level NRI also publishes. The climateRisk signal therefore reflects a county-aggregate expected-annual-loss percentile and does not infer property-level or neighborhood-level risk. Cities whose FEMA / BLS county context differs materially from the city boundary (NYC, Atlanta) surface this via the GeographyScopeNotice. The county-level limitation applies even where the city and county are coterminous (San Francisco, Washington DC).
Mixed-boundary markets
Most supported markets are mapped at a single geography level — a Census place that aligns reasonably well with the city's primary county and MSA. A few markets cannot be cleanly mapped at one level: their fixtures span a Census place, a county-equivalent, and a multi-state MSA, and that boundary mismatch materially affects how individual values should be read.
When a market's source fixtures cross boundaries like this, the city page, report pages, and public JSON all display a Mixed geography scope notice describing which source inputs are city-level, which are county-level, and which are MSA-level. Single-boundary markets render no notice at all. The notice is emitted as an optional geographyScope object in the JSON profile under the same schemaVersion: "1" contract.
See the current pipeline state
The public /data-status page shows the latest ETL run, supported markets, coverage by report, freshness, and any validation warnings.
Methodology version v5.0 · Last updated 2026-05-20
Related transparency
Data status
Latest ETL run, per-source freshness, and known limitations.
Open Data status →Data sources
Every source named, with update cadence and confidence.
Open Data sources →Data catalog
Source-attributed fixture datasets, computed signals, coverage, and access notes.
Open Data catalog →Changelog
Shipped, partial, and planned work — without timeline promises.
Open Changelog →