RE: How can I group near duplicate records under a new common field?

SSC-Insane

Points: 23781

August 28, 2015 at 6:11 pm

This 1st section is just setting up the test data and creating a couple of temp tables that'll be needed further down in the script.

The 1st table, #People, will obviously be replaced by you you own production table.

The next two temp tables (#MatchScores & #BestMatches) are just working tables... So, there should be no need to create them as permanent tables.

-- Create the necessary tables

IF OBJECT_ID('tempdb..#People') IS NOT NULL

DROP TABLE #People;

CREATE TABLE #People (

PersonID INT IDENTITY(1,1),

FirstName VARCHAR(20),

LastName VARCHAR(20),

SSN VARCHAR(9),

Address_1 VARCHAR(30),

Address_2 VARCHAR(30),

City VARCHAR(30),

[State] CHAR(2),

Zip CHAR(5),

GroupID INT,

MatchQuality INT

);

IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

DROP TABLE #MatchScores;

CREATE TABLE #MatchScores (

ParentID INT,

ChildID INT,

FirstNameScore INT,

LastNameScore INT,

SSN_Score INT,

Add1_Score INT,

Add2_Score INT,

CityScore INT,

StateScore INT,

ZipScore INT,

TotalScore INT

);

IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

DROP TABLE #BestMatches;

CREATE TABLE #BestMatches (

ParentID INT,

ChildID INT,

BestScore INT,

Lft INT,

Rgt INT

);

INSERT #People (FirstName,LastName,SSN,Address_1,Address_2,City,State,Zip) VALUES

('Bob','Jones','333224444','258 Grants Trail','#3','Bowling Green','KY','45459'),

('Robert','Jones','333224444','258 Grants Trail','Ste 3','Bowling Green','KY','45459'),

('Rob','Jones','333224444','258 Grants Tr.',NULL,'Bowling Green','KY','45459'),

('R.','Jones','333224444','258 Grants Tr.','Ste 3','Bowling Green','KY','45459'),

('Louis','Armstrong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

('L.','Armstrng','258321478','123 Humming Bird Lane',NULL,'Jacksonville','FL','32207'),

('Louie','Armstrong','258321478','123 Humming Bird Lain',NULL,'Jacksonville','FL','32209'),

('Louis','Armstong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33304'),

('M.','Adams','321456987','555 Bay Meadows Dr.','# 42','Ashvlle','NC','33304'),

('Mark','Adams','321456987','555 Baymeadows Drive','Unit 42','Ashville','NC','33305'),

('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33306'),

('Bob','Jones','555229999','4227 Some Place Ln.','#3','Bowling Green','KY','45459'),

('Bob','Jones','555229999','4227 Some Place Ln.','# 3','Bowling Green','KY','45459'),

('Bob','Jones','555292999','4227 SomePlace Ln','Unit 3','Bowling Green','KY','45459'),

('Bob','Jones','555292999','4227 Some Place Ln.','Ste 3','Bowling Green','KY','45459'),

('Louis','Armstrong','147852369','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

('Louis','Armstrong','147852369','123 Baldwin Ave',NULL,'Jacksonville','FL','32209'),

('Louis','Armstrong','147852369','123 Baldwin Anenue',NULL,'Jachsonville','FL','32209'),

('Louis','Armstrong','147852639','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

('Mark','Adam','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

('Mark','Adams','654789231','524 Main St.','Unit 42','Barkersville','NC','33304'),

('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304');

-- Take a look at the base data.

SELECT * FROM #People p;

----------------------------------------

This next section works essentially on the same "triangular join" principal that other's have posted.

The real key here is the assigning of weights. There will be some trial and error here, as you discover that certain attributes are stronger indicators of identity than others.

As it's posted, I kept it very simple, looking only at exact matches. In real life, I've gone as far as splitting the SSN into individual characters and matching the individual characters (giving max points to perfect matches and reduced points for partial matches to allow for transposed characters)... and removing all non-alphanumeric characters from names & addresses to help reduce the impact of typos. Anyway, you get the point... You can make it a simple or complex as you'd like...

WITH ColumnScores AS (-- Assign weights to the various attributes

SELECT

p1.PersonID AS ParentID,

p2.PersonID AS ChildID,

CASE WHEN p1.FirstName = p2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

CASE WHEN p1.LastName = p2.LastName THEN 2 ELSE 0 END AS LastNameScore,

CASE WHEN p1.SSN = p2.SSN THEN 6 ELSE 0 END AS SSN_Score,

CASE WHEN p1.Address_1 = p2.Address_1 THEN 2 ELSE 0 END AS Add1_Score,

CASE WHEN COALESCE(p1.Address_2, '') = COALESCE(p2.Address_2, '') THEN 1 ELSE 0 END AS Add2_Score,

CASE WHEN p1.City = p2.City THEN 1 ELSE 0 END AS CityScore,

CASE WHEN p1.State = p2.State THEN 1 ELSE 0 END AS StateScore,

CASE WHEN p1.Zip = p2.Zip THEN 1 ELSE 0 END AS ZipScore

FROM

#People p1

JOIN #People p2

ON p1.PersonID < p2.PersonID

)

INSERT #MatchScores (ParentID,ChildID,FirstNameScore,LastNameScore,SSN_Score,

Add1_Score,Add2_Score,CityScore,StateScore,ZipScore,TotalScore)

SELECT

cs.ParentID,

cs.ChildID,

cs.FirstNameScore,

cs.LastNameScore,

cs.SSN_Score,

cs.Add1_Score,

cs.Add2_Score,

cs.CityScore,

cs.StateScore,

cs.ZipScore,

cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore AS TotalScore

FROM

ColumnScores cs

WHERE

cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore > 7 -- Set a threashold for total match quality

ORDER BY

cs.ParentID;

Here is what the data in #MatchScores should look like at this point...

ParentID ChildID FirstNameScore LastNameScore SSN_Score Add1_Score Add2_Score CityScore StateScore ZipScore TotalScore

----------- ----------- -------------- ------------- ----------- ----------- ----------- ----------- ----------- ----------- -----------

1 2 0 2 6 2 0 1 1 1 13

1 3 0 2 6 0 0 1 1 1 11

1 4 0 2 6 0 0 1 1 1 11

2 3 0 2 6 0 0 1 1 1 11

2 4 0 2 6 0 1 1 1 1 12

3 4 0 2 6 2 0 1 1 1 13

5 6 0 0 6 0 1 1 1 0 9

5 7 0 2 6 0 1 1 1 1 12

5 8 1 0 6 2 1 1 1 1 13

6 7 0 0 6 0 1 1 1 0 9

6 8 0 0 6 0 1 1 1 0 9

7 8 0 0 6 0 1 1 1 1 10

9 10 0 2 6 0 0 0 1 1 10

9 11 1 2 6 0 1 1 1 0 12

9 12 1 2 6 2 1 1 1 0 14

10 11 0 2 6 0 0 0 1 0 9

10 12 0 2 6 0 0 0 1 0 9

11 12 1 2 6 0 1 1 1 0 12

13 14 1 2 6 2 0 1 1 1 14

13 16 1 2 0 2 0 1 1 1 8

14 16 1 2 0 2 0 1 1 1 8

15 16 1 2 6 0 0 1 1 1 12

17 18 1 2 6 0 1 1 1 1 13

17 19 1 2 6 0 1 0 1 1 12

17 20 1 2 0 2 1 1 1 1 9

18 19 1 2 6 0 1 0 1 1 12

21 22 1 0 6 2 1 1 1 1 13

21 23 1 2 0 2 1 0 1 1 8

21 24 1 2 6 2 1 1 1 1 15

22 24 1 0 6 2 1 1 1 1 13

23 24 1 2 0 2 1 0 1 1 8

Note that the original 24 rows of test data has become 31 rows AND all of the matches have a score of 8 or better...

Lowering the threshold, would have included more rows and raising it would include fewer rows.

My choice to use ">7" was fairly arbitrary. My suggestion would be to start with a lower threshold and raise it gradually until you agree with the final grouping results.

My next post will cover the rest of the script... To be continued...