Skip to main content

Methodology

ETL pipeline, source snapshots, and computed signals

Transparent end-to-end description of how Housing BuildDesignHub turns public datasets into source-attributed market profiles.

Data integrity notice

Housing BuildDesignHub currently uses source-attributed static fixtures to validate the data model and UI. Metrics are not live and should not be treated as investment, legal, financial, or property-level advice.

The pipeline

Four steps, in order, every run.

Four numbered nodes labeled Extract, Transform, Validate, Load, connected by a dashed data-flow line.
Abstract map with a central Housing data layer hub connected by spokes to eight labeled circles for Census, HUD, FEMA, NOAA, BLS, BEA, EPA, and NASA.

Eight independent federal data agencies feed this project's data layer. The diagram is illustrative; each agency publishes its own data on its own cadence under its own terms.

  1. Step 1

    Extract

    Each registered source adapter loads a canonical SourceSnapshot for every supported geography. Snapshots carry an id, source id, dataset name, period, retrieved/release dates, source URL, update frequency, confidence, and the raw metric values themselves.

  2. Step 2

    Transform

    The transform step assembles a NormalizedMarketProfile per geography: metrics keyed by stable ids (population, medianGrossRent, homePriceIndex, …), computed signal scores, an aggregated source attribution roster, freshness info, and a coverage report.

  3. Step 3

    Validate

    Four validators run on the pipeline output: freshness (age vs. cadence), completeness (required metrics per report), geography (required identity fields), and metrics (units present, no impossible values). Findings are warnings — not build failures — and surface on /data-status.

  4. Step 4

    Load

    Phase 3 'load' writes profiles into an in-memory store keyed by geography id. The interface is intentionally db-shaped so a Postgres / KV / S3 loader can replace it without changing the rest of the pipeline.

Why public data

Background from the U.S. Census Bureau on the American Community Survey — the survey program behind the ACS 5-year tables this project uses. Embedded as agency context, not as a partnership or endorsement.

Core concepts

Source snapshots

The unit of work. A snapshot is a serializable record of one source's output for one geography at one point in time. Snapshots can be cached, persisted, replayed, and validated independently.

Normalized market profiles

What the UI consumes. A profile is a per-geography object combining raw metrics, computed signals, source attributions, freshness, and coverage — built deterministically from the snapshot set.

Computed scores

Affordability, supply, price momentum, job market, and climate risk are derived from raw inputs using fixed, documented formulas. Each score is marked sourceId="computed" so it is never confused with an observed value.

Confidence levels

Each source carries a confidence rating — high, medium, or experimental — based on coverage, stability, and methodology. Signals leaning on experimental inputs are flagged.

Fixture vs live data

The current pipeline ships official-source fixtures (deterministic JSON snapshots whose shape matches the public release). The UI labels every metric's provenance as 'fixture' or 'live' — never as 'real-time'. Swapping to live data is a per-adapter change.

Future scheduled ETL

The same extract/transform/validate/load contract will drive a scheduled ETL run (Census API → adapter → snapshot store → profile store). The UI does not need to know whether snapshots came from a fixture or a scheduled run — only that they passed validation.

Climate risk — separate from opportunity

The climateRisk signal mirrors FEMA's National Risk Index: higher = more hazard exposure. To keep semantics from blurring, climateRisk is shown on its own (never inverted) and only enters the overall composite as a 0-20 point penalty. The underlying mapping is county-level — Phase 4A does not yet use the tract-level NRI release.

Signal formulas

  • Affordability signal

    computed signal

    rentBurden = annualGrossRent / medianHouseholdIncome; score = clamp(round((0.30 − rentBurden) / 0.30 × 100 + 50), 0, 100)

    rentBurden = 0.30 → 50. Lower burden → higher score.

    Inputs: ACS medianGrossRent, ACS medianHouseholdIncome

  • Supply signal

    computed signal

    intensity = (trailing12PermitTotal / housingUnits) × 1,000; score = clamp(round(intensity / 0.6), 0, 100)

    ~60 permits per 1,000 existing units per year ≈ 100.

    Inputs: Census BPS trailing-12 permits, ACS housing units

  • Price momentum signal

    computed signal

    score = clamp(round(50 + hpiYoYChangePct × 5), 0, 100)

    0% YoY → 50. +10% → 100. −10% → 0.

    Inputs: FHFA HPI YoY % change

  • Climate risk signal

    computed signal

    score = round(femaNriRiskIndex) // 0-100, higher = more hazard exposure

    Surfaces FEMA's National Risk Index directly — no transformation. Direction follows FEMA: higher means more exposure. Kept separate from the positive signals so the semantics don't blur.

    Inputs: FEMA NRI naturalHazardRiskIndex (county-level)

  • Job market signal

    computed signal

    level = clamp((8.5 − unemploymentRate) / (8.5 − 2.5) × 100, 0, 100); trend = clamp((1.0 − unemploymentRateChange12m) / 2.0 × 100, 0, 100) (missing → 50); jobMarket = round(level × 0.7 + trend × 0.3)

    Positive 0-100 signal. Lower unemployment and a falling 12-month trend both raise it. Level weighted 70%, trend 30%. ≤2.5% → 100 on level; ≥8.5% → 0. −1.0 ppt YoY → 100 on trend; +1.0 ppt → 0.

    Inputs: BLS LAUS unemploymentRate, unemploymentRateChange12m (county-level)

  • Economic strength signal

    computed signal

    incomeGrowth = clamp((pcpiGrowthYoY + 2) / 8 × 100, 0, 100) [weight 0.45]; incomeLevel = clamp((perCapitaPersonalIncome − 40000) / 60000 × 100, 0, 100) [weight 0.35]; gdpGrowth = clamp((regionalGdpGrowthYoY + 2) / 8 × 100, 0, 100) [weight 0.20]; if gdpGrowth missing, renormalize the two remaining weights; economicStrength = round(weighted sum)

    Positive 0-100 signal. Higher per-capita income, faster income growth, and faster regional GDP growth all raise it. Direction follows BEA: higher = stronger regional economic context. NOT investment advice.

    Inputs: BEA Regional perCapitaPersonalIncome, perCapitaPersonalIncomeGrowthYoY, and (optionally) regionalGdpGrowthYoY (MSA-level)

  • Overall score

    computed signal

    positives = avg of available {affordability, supply, priceMomentum, jobMarket, economicStrength}; penalty = (climateRisk / 100) × 20; overall = clamp(positives − penalty, 0, 100)

    Average of available positive sub-scores, minus a climate-risk penalty (max 20 points at FEMA risk 100). Missing positives are skipped (not substituted); missing climateRisk means no penalty. Direction stays consistent: higher overall = better risk-adjusted opportunity.

    Inputs: Whichever sub-scores are present, plus climateRisk if FEMA NRI is loaded

Known signal-formula limitations

Supply score — MSA numerator over city denominator

The supply signal is computed as permits_trailing_12 / housing_units × 1000 then scaled to a 0–100 score. Census BPS reports permits at the MSA level, while Census ACS reports housing units at the city / place level. For cities where the metro is much larger than the city proper, this puts a multi-county permit count over a single-city housing stock, inflating the ratio and clipping the score at the methodology ceiling of 100.

Markets currently at the supply ceiling under this formula include Austin, Dallas, Phoenix, Miami, Tampa, Boston, Washington DC, and Atlanta. Markets where the city proper is large enough to keep the ratio in range (Los Angeles, New York, Chicago, San Diego, San Francisco) score normally. This is a denominator-choice limitation, not a statement that the affected cities have the strongest new-construction supply in the country.

When the supply score is at the ceiling, the city / housing-market / investment-score pages surface a small inline disclosure linking back to this section.

Affordability score — rent-to-income only

The affordability signal uses Census ACS median gross rent and median household income — a rent-burden measure. It does not include a home-value-to-income term. Cities with extreme home prices but moderate ACS rent (San Francisco, New York) can score affordably under this measure even though home-purchase affordability is dramatically worse than the rent burden suggests. This is consistent with how rent-burden is reported by HUD and ACS, but consumers should not read a high affordability score as a home-purchase recommendation.

Climate risk — county-level aggregation

FEMA NRI is consumed at the county level in this project, not the census-tract level NRI also publishes. The climateRisk signal therefore reflects a county-aggregate expected-annual-loss percentile and does not infer property-level or neighborhood-level risk. Cities whose FEMA / BLS county context differs materially from the city boundary (NYC, Atlanta) surface this via the GeographyScopeNotice. The county-level limitation applies even where the city and county are coterminous (San Francisco, Washington DC).

Mixed-boundary markets

Most supported markets are mapped at a single geography level — a Census place that aligns reasonably well with the city's primary county and MSA. A few markets cannot be cleanly mapped at one level: their fixtures span a Census place, a county-equivalent, and a multi-state MSA, and that boundary mismatch materially affects how individual values should be read.

When a market's source fixtures cross boundaries like this, the city page, report pages, and public JSON all display a Mixed geography scope notice describing which source inputs are city-level, which are county-level, and which are MSA-level. Single-boundary markets render no notice at all. The notice is emitted as an optional geographyScope object in the JSON profile under the same schemaVersion: "1" contract.

See the current pipeline state

The public /data-status page shows the latest ETL run, supported markets, coverage by report, freshness, and any validation warnings.

Methodology version v5.0 · Last updated 2026-05-20

Related transparency