Skip to main content

Master Data Management (MDM)

Entity resolution, golden records, survivorship, stewardship, hierarchies, and consent — all native to Databricks Unity Catalog.

Overview

The MDM portal under /mdm/* is the first open-source Databricks-native MDM. It treats the source of truth as a Delta-backed registry rather than a separate vendor system, so all the usual UC tooling (lineage, permissions, audit) just works.

Source: src/mdm.py · src/mdm_store.py · /api/mdm · UI under /mdm/*.

Pages

URLPurpose
/mdmOverview — current entity counts, recent merge activity, health by domain
/mdm/match-mergeBuild & run match-merge jobs (entity resolution)
/mdm/golden-recordsCurated golden records per entity, with field-level provenance
/mdm/merge-historyAudit trail of every merge / un-merge with reason and reviewer
/mdm/stewardshipSteward queue — pending merges that need human review
/mdm/hierarchiesHierarchy management (legal entity tree, product taxonomy, etc.)
/mdm/cross-domainCross-domain matching (customer ↔ contact ↔ account)
/mdm/profilingPer-attribute profiling: completeness, uniqueness, validity
/mdm/scorecardsDQ scorecards per entity domain
/mdm/reference-dataReference data CRUD (country codes, currency codes, account types)
/mdm/consentConsent management — track GDPR / CCPA consent flags
/mdm/negative-matchNegative match library — known-not-the-same pairs to avoid false merges
/mdm/audit-logAppend-only audit log of every MDM operation
/mdm/templatesIndustry templates (Healthcare, Financial, Retail, Manufacturing)
/mdm/reportsPre-built reports for stewardship, DQ, hierarchy completeness
/mdm/settingsMatch thresholds, survivorship rules, scheduler config

Entity resolution

Six match-type strategies are bundled in src/mdm.py:

Match typeBest for
exactHashable IDs (email after normalisation)
phoneticPerson names with spelling variants (Soundex / Metaphone)
fuzzy_stringFree-text address lines (Jaro-Winkler / token-set ratio)
numeric_windowMoney / dates within a tolerance
geo_distanceLat/lon within a radius
compositeWeighted combination of multiple matchers

A match-merge run produces three outputs:

  1. High-confidence merges — auto-applied to the golden record (above auto_merge_threshold)
  2. Steward queue — borderline cases routed to /mdm/stewardship for human review
  3. Rejected pairs — clearly different; written to negative-match library so they're never re-evaluated

Survivorship rules

When two records merge, survivorship decides which value wins per field. Default rules:

StrategyBehaviour
most_recentPick the value with the latest updated_at
most_completeLongest non-null string, highest non-null number
source_priorityPick by source system (CRM > ERP > web)
votingMost-frequent value across all source records
aggregateSum / max / list (for numerics, multi-value fields)
manualSteward picks

Configure per-field via the UI at /mdm/settings or POST /api/mdm/survivorship-rules.

Templates

Industry templates pre-load the entity model (attributes + match rules + survivorship config) for common domains:

  • Healthcare — Patient, Provider, Payer, Encounter
  • Financial — Customer, Account, Transaction, Counterparty
  • Retail — Customer, Product, Order, Address, Loyalty Member
  • Manufacturing — Supplier, Part, Plant, Work Order

Apply via /mdm/templates or POST /api/mdm/templates/apply with { template_id, domain_name }.

Storage layout

All MDM state lives in clone_audit.mdm.*:

TableContents
entitiesOne row per entity domain (Customer, Product, …)
golden_recordsCurated master records with field-level provenance
merge_historyEvery merge / un-merge with reason and reviewer
stewardship_queuePending borderline matches
hierarchiesClosure-table hierarchy (parent_id, child_id, depth)
reference_dataVersioned lookup lists
consentConsent flags per subject + purpose + jurisdiction

Query directly for custom dashboards; the UI at /mdm/audit-log already exposes the audit table with filters.

  • Governance — data dictionary, certifications, change history
  • DSAR — discover personal data across the MDM golden records
  • RTBF — erasure cascades through golden records and merge history
  • Compliance Frameworks — consent management feeds GDPR Article 30 evidence