DEMO DATAArchitecture overview of the MDM pipeline. All data is synthetic.

MDM Pipeline Architecture

How 5 separate ERP systems become one unified customer view

Source Systems

SQL Server

ERP 1 — UK Distributor

erp1.customer, erp1.sales

ERP 2 — US East Coast

erp2.customer, erp2.sales

ERP 3 — US West / Govt

erp3.customer, erp3.sales

ERP 4 — Canada / Mixed

erp4.customer, erp4.sales

ERP 5 — US Southeast MRO

erp5.customer, erp5.sales

SQL Server 2022 Express · 5 schemas · Different column naming per ERP

Extract

Node.js

extract_erp.js

Pulls customer + sales from all 5 ERP schemas

Node.js · mssql + pg drivers · Full extract, drop & recreate

Bronze Layer

PostgreSQL

bronze_erp1

customer, sales — raw column names preserved

bronze_erp2

customer, sales — CustNum, OrdId, etc.

bronze_erp3

customer, sales — customer_number, order_id

bronze_erp4

customer, sales — cust_key, so_num

bronze_erp5

customer, sales — CUST_ID, ORD_NUM

PostgreSQL · 5 schemas · 10 tables · Raw data, no transformations

Silver Layer

dbt Core

silver_erp.erp1_customer → erp5_customer

All normalized to: source_system, customer_name, city, state_province, postal_code, tax_id, email

silver_erp.erp1_sales → erp5_sales

All normalized to: source_system, order_id, source_customer_id, order_date, order_amount

dbt Core · 10 SQL models · Column name normalization · NULL handling for missing fields

Splink Matching

Python + Splink

Probabilistic Record Linkage

Jaro-Winkler on customer names · Exact match on tax_id · Levenshtein on postal_code · Blocking on name, tax_id, city+state

EM Training

Expectation-maximization estimates match/non-match probabilities from the data itself

Clustering

Pairwise predictions clustered at 70% match probability threshold

Python · Splink · DuckDB backend · Fellegi-Sunter model · 333 records → 251 golden

MDM Layer

Golden Records

mdm.customer_golden

251 unified customer records — one canonical name, address, tax ID per entity

mdm.customer_xref

333 rows mapping every source record to its golden customer ID

mdm.customer_clusters

Raw Splink cluster assignments with match metadata

PostgreSQL mdm schema · Best-record selection by name completeness · 82 duplicates resolved (24.6%)

Match Examples

How Splink resolved different name variations into a single golden record

→THE BOEING COMPANYGolden Record

8 records matched

erp2·Boeing Co.

erp2·BOEING CO.

erp3·BOEING COMPANY

erp4·Boeing

erp4·BOEING COMPANY

erp5·THE BOEING COMPANY

erp5·Boeing Co

erp3·THE BOEING COMPANY

→RAYTHEON TECHNOLOGIES CORPGolden Record

6 records matched

erp2·RTX Corporation

erp2·RTX Corp

erp3·RAYTHEON TECHNOLOGIES CORP

erp4·Raytheon Technologies

erp4·Raytheon

erp5·Raytheon Tech

→Lockheed Martin CorporationGolden Record

5 records matched

erp2·Lockheed Martin Corporation

erp3·Lockheed Martin

erp3·LOCKHEED MARTIN CORP

erp4·Lockheed Martin

erp5·Lockheed Martin Corp

Technology Stack

SQL Server 2022

Source ERP databases

Node.js

Data extraction layer

PostgreSQL

Data warehouse (Bronze → MDM)

dbt Core

Silver normalization models

Python + Splink

Probabilistic record linkage

DuckDB

Splink compute backend

Fastify

REST API serving MDM data

Next.js

Dashboard & visualization

Pipeline Metrics

Source ERPs

SQL Server schemas

333

Source Records

Customer records extracted

251

Golden Records

Unique entities identified

24.6%

Dedup Rate

82 duplicates resolved

dbt Models

Silver normalization layer

Blocking Rules

name, tax_id, postal, city+state

70%

Match Threshold

Cluster probability cutoff

922

Sales Orders

Linked to golden customers