How 5 separate ERP systems become one unified customer view
ERP 1 — UK Distributor
erp1.customer, erp1.sales
ERP 2 — US East Coast
erp2.customer, erp2.sales
ERP 3 — US West / Govt
erp3.customer, erp3.sales
ERP 4 — Canada / Mixed
erp4.customer, erp4.sales
ERP 5 — US Southeast MRO
erp5.customer, erp5.sales
SQL Server 2022 Express · 5 schemas · Different column naming per ERP
extract_erp.js
Pulls customer + sales from all 5 ERP schemas
Node.js · mssql + pg drivers · Full extract, drop & recreate
bronze_erp1
customer, sales — raw column names preserved
bronze_erp2
customer, sales — CustNum, OrdId, etc.
bronze_erp3
customer, sales — customer_number, order_id
bronze_erp4
customer, sales — cust_key, so_num
bronze_erp5
customer, sales — CUST_ID, ORD_NUM
PostgreSQL · 5 schemas · 10 tables · Raw data, no transformations
silver_erp.erp1_customer → erp5_customer
All normalized to: source_system, customer_name, city, state_province, postal_code, tax_id, email
silver_erp.erp1_sales → erp5_sales
All normalized to: source_system, order_id, source_customer_id, order_date, order_amount
dbt Core · 10 SQL models · Column name normalization · NULL handling for missing fields
Probabilistic Record Linkage
Jaro-Winkler on customer names · Exact match on tax_id · Levenshtein on postal_code · Blocking on name, tax_id, city+state
EM Training
Expectation-maximization estimates match/non-match probabilities from the data itself
Clustering
Pairwise predictions clustered at 70% match probability threshold
Python · Splink · DuckDB backend · Fellegi-Sunter model · 333 records → 251 golden
mdm.customer_golden
251 unified customer records — one canonical name, address, tax ID per entity
mdm.customer_xref
333 rows mapping every source record to its golden customer ID
mdm.customer_clusters
Raw Splink cluster assignments with match metadata
PostgreSQL mdm schema · Best-record selection by name completeness · 82 duplicates resolved (24.6%)
How Splink resolved different name variations into a single golden record
SQL Server 2022
Source ERP databases
Node.js
Data extraction layer
PostgreSQL
Data warehouse (Bronze → MDM)
dbt Core
Silver normalization models
Python + Splink
Probabilistic record linkage
DuckDB
Splink compute backend
Fastify
REST API serving MDM data
Next.js
Dashboard & visualization
5
Source ERPs
SQL Server schemas
333
Source Records
Customer records extracted
251
Golden Records
Unique entities identified
24.6%
Dedup Rate
82 duplicates resolved
10
dbt Models
Silver normalization layer
4
Blocking Rules
name, tax_id, postal, city+state
70%
Match Threshold
Cluster probability cutoff
922
Sales Orders
Linked to golden customers