DEMO DATAArchitecture overview of the MDM pipeline. All data is synthetic.

MDM Pipeline Architecture

How 5 separate ERP systems become one unified customer view

1

Source Systems

SQL Server

ERP 1 — UK Distributor

erp1.customer, erp1.sales

ERP 2 — US East Coast

erp2.customer, erp2.sales

ERP 3 — US West / Govt

erp3.customer, erp3.sales

ERP 4 — Canada / Mixed

erp4.customer, erp4.sales

ERP 5 — US Southeast MRO

erp5.customer, erp5.sales

SQL Server 2022 Express · 5 schemas · Different column naming per ERP

2

Extract

Node.js

extract_erp.js

Pulls customer + sales from all 5 ERP schemas

Node.js · mssql + pg drivers · Full extract, drop & recreate

3

Bronze Layer

PostgreSQL

bronze_erp1

customer, sales — raw column names preserved

bronze_erp2

customer, sales — CustNum, OrdId, etc.

bronze_erp3

customer, sales — customer_number, order_id

bronze_erp4

customer, sales — cust_key, so_num

bronze_erp5

customer, sales — CUST_ID, ORD_NUM

PostgreSQL · 5 schemas · 10 tables · Raw data, no transformations

4

Silver Layer

dbt Core

silver_erp.erp1_customer → erp5_customer

All normalized to: source_system, customer_name, city, state_province, postal_code, tax_id, email

silver_erp.erp1_sales → erp5_sales

All normalized to: source_system, order_id, source_customer_id, order_date, order_amount

dbt Core · 10 SQL models · Column name normalization · NULL handling for missing fields

5

Splink Matching

Python + Splink

Probabilistic Record Linkage

Jaro-Winkler on customer names · Exact match on tax_id · Levenshtein on postal_code · Blocking on name, tax_id, city+state

EM Training

Expectation-maximization estimates match/non-match probabilities from the data itself

Clustering

Pairwise predictions clustered at 70% match probability threshold

Python · Splink · DuckDB backend · Fellegi-Sunter model · 333 records → 251 golden

6

MDM Layer

Golden Records

mdm.customer_golden

251 unified customer records — one canonical name, address, tax ID per entity

mdm.customer_xref

333 rows mapping every source record to its golden customer ID

mdm.customer_clusters

Raw Splink cluster assignments with match metadata

PostgreSQL mdm schema · Best-record selection by name completeness · 82 duplicates resolved (24.6%)

Match Examples

How Splink resolved different name variations into a single golden record

THE BOEING COMPANYGolden Record
8 records matched
erp2·Boeing Co.
erp2·BOEING CO.
erp3·BOEING COMPANY
erp4·Boeing
erp4·BOEING COMPANY
erp5·THE BOEING COMPANY
erp5·Boeing Co
erp3·THE BOEING COMPANY
RAYTHEON TECHNOLOGIES CORPGolden Record
6 records matched
erp2·RTX Corporation
erp2·RTX Corp
erp3·RAYTHEON TECHNOLOGIES CORP
erp4·Raytheon Technologies
erp4·Raytheon
erp5·Raytheon Tech
Lockheed Martin CorporationGolden Record
5 records matched
erp2·Lockheed Martin Corporation
erp3·Lockheed Martin
erp3·LOCKHEED MARTIN CORP
erp4·Lockheed Martin
erp5·Lockheed Martin Corp

Technology Stack

SQL Server 2022

Source ERP databases

Node.js

Data extraction layer

PostgreSQL

Data warehouse (Bronze → MDM)

dbt Core

Silver normalization models

Python + Splink

Probabilistic record linkage

DuckDB

Splink compute backend

Fastify

REST API serving MDM data

Next.js

Dashboard & visualization

Pipeline Metrics

5

Source ERPs

SQL Server schemas

333

Source Records

Customer records extracted

251

Golden Records

Unique entities identified

24.6%

Dedup Rate

82 duplicates resolved

10

dbt Models

Silver normalization layer

4

Blocking Rules

name, tax_id, postal, city+state

70%

Match Threshold

Cluster probability cutoff

922

Sales Orders

Linked to golden customers