How can I group near duplicate records under a new common field?

Question

How can I group near duplicate records under a new common field?

craig.bobchin

SSCrazy

Points: 2298
More actions
August 26, 2015 at 4:55 pm

#311114

Hi all,
I've inherited a table of members that has the following structure:
CREATE TABLE [dbo].[dimMember](
[dimMemberId] [int] IDENTITY(1,1) NOT NULL,
[dimSourceSystemId] [int] NOT NULL CONSTRAINT [DF_dimMember_dimSourceSystemId] DEFAULT ((-1)),
[MemberCode] [nvarchar](50) NOT NULL,
[FirstName] [nvarchar](250) NOT NULL,
[LastName] [nvarchar](250) NOT NULL,
[MiddleInitial] [nvarchar](5) NOT NULL,
[dimDayidDOB] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayidDOB] DEFAULT ((-1)),
[dimDayIdDeath] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdDeath] DEFAULT ((99991231)),
[Gender] [nvarchar](10) NOT NULL,
[SubcriberRelationship] [nvarchar](50) NOT NULL,
[SSN] [nvarchar](25) NOT NULL CONSTRAINT [DF_dimMember_SSN] DEFAULT (N'n/a'),
[ExecutionGUID] [uniqueidentifier] NOT NULL,
[RowRecordedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowRecordedDate] DEFAULT (getdate()),
[isCurrent] [bit] NOT NULL CONSTRAINT [DF_dimMember_isCurrent] DEFAULT ((1)),
[SubscriberCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_SubscriberId] DEFAULT (N'n/a'),
[RowUpdatedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowUpdatedDate] DEFAULT (getdate()),
[dimDayIdSubscriberStart] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberStart] DEFAULT ((1)),
[dimDayIdSubscriberEnd] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberEnd] DEFAULT ((99991231)),
[Race] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Race] DEFAULT (N'n/a'),
[Ethnicity] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Ethnicity] DEFAULT (N'n/a'),
[PrimaryLanguage] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_PrimaryLanguage] DEFAULT (N'n/a'),
[CurrentAddress1] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress] DEFAULT (N'n/a'),
[CurrentAddress2] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress2] DEFAULT (N'n/a'),
[CurrentZipCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_CurrentZipCode] DEFAULT (N'n/a'),
CONSTRAINT [PK_dimMember] PRIMARY KEY CLUSTERED
Based on the way the data loads into the table there's a possibility of some records being near duplicates of each other. For example, we can have a member that has records that have the same first name, last name, SSN, but different addresses, membercodes, subscribercode etc... This can happen in pretty much any variation thereof.
What I want to do, is add a new column and use that to group the similar records under based on comparing on several columns. By this I mean, if a member matches 4 of the 7 values below with another member, we would group these:
First Name (1st 3 characters)
Last Name
DOB
CurrentAddress1
MemberCode
SSN
SubscriberCode
I'm at a loss of how to structure the SQL to update the new column in the table. Any help is appreciated.

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 1

Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

craig.bobchin SSCrazy Points: 2298 More actions · Answer 2

Jeff Moden (8/26/2015)
Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.

You're preaching to the choir on that. The team I work on is all pretty new to this company (all joined in the last 6 months) and we all see this as an issue. We didn't have a say in the table creation and are trying our best to fix the existing problems as quickly as we can. That is high on the list of things to fix.

The main issue we have, is that while we can encrypt the SSN on our reporting database, the OLTP system we get the data (written and supported by a 3rd party vendor) from does not encrypt the SSN on their end.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 3

craig.bobchin (8/26/2015)
Jeff Moden (8/26/2015)
Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.
You're preaching to the choir on that. The team I work on is all pretty new to this company (all joined in the last 6 months) and we all see this as an issue. We didn't have a say in the table creation and are trying our best to fix the existing problems as quickly as we can. That is high on the list of things to fix.
The main issue we have, is that while we can encrypt the SSN on our reporting database, the OLTP system we get the data (written and supported by a 3rd party vendor) from does not encrypt the SSN on their end.

Been there and done that. No fun at all yet lots of fun trying to fix what a whole 'nuther team managed to mess up. And, yeah... 3rd party vendors are a joy, aren't they?

Getting back to your problem... I see that the SSN column is NOT NULL but does have a default. For those that do have an SSN, I'd base the near duplicates on SSN only to determine the near duplicates.

As for the rest of them and the 4 of 7 grouping, I did such a similar thing a couple of years ago and it worked very quickly. I have to see if I can find the code in my archives because it was the only time I had to do such a thing so far. IIRC, it used cascading CTEs but I don't remember the particulars.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 4

I found the code I was looking for and, unfortunately, it solves for a problem quite different that I thought I remembered and doesn't come close to solving this problem.

However, I do have an idea on how to solve this one using a CTE that contains a CROSS JOIN and 6 additive case statements with an outer query that looks for a count of 4 or more and does a group by one of the IDs. I need to test it. How many rows does your original table have in it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 5

I've done something very similar but it far from fast, especially on larger sets.

The following is a simplified version of what I did in the past but it does illustrate the premise.

(I apologize in advance for the lack of better comments...)

-- Create the necessary tables

IF OBJECT_ID('tempdb..#People') IS NOT NULL

DROP TABLE #People;

CREATE TABLE #People (

PersonID INT IDENTITY(1,1),

FirstName VARCHAR(20),

LastName VARCHAR(20),

SSN VARCHAR(9),

Address_1 VARCHAR(30),

Address_2 VARCHAR(30),

City VARCHAR(30),

[State] CHAR(2),

Zip CHAR(5),

GroupID INT,

MatchQuality INT

);

IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

DROP TABLE #MatchScores;

CREATE TABLE #MatchScores (

ParentID INT,

ChildID INT,

FirstNameScore INT,

LastNameScore INT,

SSN_Score INT,

Add1_Score INT,

Add2_Score INT,

CityScore INT,

StateScore INT,

ZipScore INT,

TotalScore INT

);

IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

DROP TABLE #BestMatches;

CREATE TABLE #BestMatches (

ParentID INT,

ChildID INT,

BestScore INT,

Lft INT,

Rgt INT

);

INSERT #People (FirstName,LastName,SSN,Address_1,Address_2,City,State,Zip) VALUES

('Bob','Jones','333224444','258 Grants Trail','#3','Bowling Green','KY','45459'),

('Robert','Jones','333224444','258 Grants Trail','Ste 3','Bowling Green','KY','45459'),

('Rob','Jones','333224444','258 Grants Tr.',NULL,'Bowling Green','KY','45459'),

('R.','Jones','333224444','258 Grants Tr.','Ste 3','Bowling Green','KY','45459'),

('Louis','Armstrong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

('L.','Armstrng','258321478','123 Humming Bird Lane',NULL,'Jacksonville','FL','32207'),

('Louie','Armstrong','258321478','123 Humming Bird Lain',NULL,'Jacksonville','FL','32209'),

('Louis','Armstong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33304'),

('M.','Adams','321456987','555 Bay Meadows Dr.','# 42','Ashvlle','NC','33304'),

('Mark','Adams','321456987','555 Baymeadows Drive','Unit 42','Ashville','NC','33305'),

('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33306'),

('Bob','Jones','555229999','4227 Some Place Ln.','#3','Bowling Green','KY','45459'),

('Bob','Jones','555229999','4227 Some Place Ln.','# 3','Bowling Green','KY','45459'),

('Bob','Jones','555292999','4227 SomePlace Ln','Unit 3','Bowling Green','KY','45459'),

('Bob','Jones','555292999','4227 Some Place Ln.','Ste 3','Bowling Green','KY','45459'),

('Louis','Armstrong','147852369','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

('Louis','Armstrong','147852369','123 Baldwin Ave',NULL,'Jacksonville','FL','32209'),

('Louis','Armstrong','147852369','123 Baldwin Anenue',NULL,'Jachsonville','FL','32209'),

('Louis','Armstrong','147852639','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

('Mark','Adam','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

('Mark','Adams','654789231','524 Main St.','Unit 42','Barkersville','NC','33304'),

('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304');

-- Take a look at the base data.

SELECT * FROM #People p;

WITH ColumnScores AS (-- Assign weights to the various attributes

SELECT

p1.PersonID AS ParentID,

p2.PersonID AS ChildID,

CASE WHEN p1.FirstName = p2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

CASE WHEN p1.LastName = p2.LastName THEN 2 ELSE 0 END AS LastNameScore,

CASE WHEN p1.SSN = p2.SSN THEN 6 ELSE 0 END AS SSN_Score,

CASE WHEN p1.Address_1 = p2.Address_1 THEN 2 ELSE 0 END AS Add1_Score,

CASE WHEN COALESCE(p1.Address_2, '') = COALESCE(p2.Address_2, '') THEN 1 ELSE 0 END AS Add2_Score,

CASE WHEN p1.City = p2.City THEN 1 ELSE 0 END AS CityScore,

CASE WHEN p1.State = p2.State THEN 1 ELSE 0 END AS StateScore,

CASE WHEN p1.Zip = p2.Zip THEN 1 ELSE 0 END AS ZipScore

FROM

#People p1

JOIN #People p2

ON p1.PersonID < p2.PersonID

)

INSERT #MatchScores (ParentID,ChildID,FirstNameScore,LastNameScore,SSN_Score,

Add1_Score,Add2_Score,CityScore,StateScore,ZipScore,TotalScore)

SELECT

cs.ParentID,

cs.ChildID,

cs.FirstNameScore,

cs.LastNameScore,

cs.SSN_Score,

cs.Add1_Score,

cs.Add2_Score,

cs.CityScore,

cs.StateScore,

cs.ZipScore,

cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore AS TotalScore

FROM

ColumnScores cs

WHERE

cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore > 7 -- Set a threashold for total match quality

ORDER BY

cs.ParentID;

--===============================================================

INSERT #BestMatches (ParentID,ChildID,BestScore)

SELECT

bs.ParentID,

bs.ChildID,

bs.TotalScore AS BestScore

FROM

(SELECT DISTINCT ms1.ChildID

FROM #MatchScores ms1) c

CROSS APPLY (

SELECT TOP 1

*

FROM #MatchScores ms2

WHERE c.ChildID = ms2.ChildID

ORDER BY ms2.TotalScore DESC

) bs;

INSERT #BestMatches (ParentID,ChildID,BestScore)

SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)

UNION ALL

SELECT DISTINCT

0 AS ParentID,

ms.ParentID AS ChildID,

100 AS BestScore

FROM #MatchScores ms

WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE ms.ParentID = bm.ChildID);

--==========================================================================

-- Use nested sets to create the groups

EXEC dbo.CreateNestedSets-- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

@TableName = '#BestMatches',

@ChildColName = 'ChildID',

@ParentColName = 'ParentID',

@LeftColName = 'Lft',

@RightColName = 'Rgt',

@DeBug = 0;

DELETE #BestMatches WHERE ParentID IS NULL;

--==========================================================================

UPDATE p SET

p.GroupID = bm.GroupID,

p.MatchQuality = bm.BestScore

FROM

#People p

JOIN (

SELECT

Groups.ChildID AS GroupID,

bm2.ChildID,

bm2.BestScore

FROM (

SELECT

bm1.ChildID,

bm1.Lft,

bm1.Rgt

FROM

#BestMatches bm1

WHERE

bm1.ParentID = 0

) Groups

JOIN #BestMatches bm2

ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt

) bm

ON p.PersonID = bm.ChildID;

SELECT * FROM #People p;

The results...

PersonID FirstName LastName SSN Address_1 Address_2 City State Zip GroupID MatchQuality

----------- -------------------- -------------------- --------- ------------------------------ ------------------------------ ------------------------------ ----- ----- ----------- ------------

1 Bob Jones 333224444 258 Grants Trail #3 Bowling Green KY 45459 1 100

2 Robert Jones 333224444 258 Grants Trail Ste 3 Bowling Green KY 45459 1 13

3 Rob Jones 333224444 258 Grants Tr. NULL Bowling Green KY 45459 1 11

4 R. Jones 333224444 258 Grants Tr. Ste 3 Bowling Green KY 45459 1 13

5 Louis Armstrong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 100

6 L. Armstrng 258321478 123 Humming Bird Lane NULL Jacksonville FL 32207 5 9

7 Louie Armstrong 258321478 123 Humming Bird Lain NULL Jacksonville FL 32209 5 12

8 Louis Armstong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 13

9 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33304 9 100

10 M. Adams 321456987 555 Bay Meadows Dr. # 42 Ashvlle NC 33304 9 10

11 Mark Adams 321456987 555 Baymeadows Drive Unit 42 Ashville NC 33305 9 12

12 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33306 9 14

13 Bob Jones 555229999 4227 Some Place Ln. #3 Bowling Green KY 45459 13 100

14 Bob Jones 555229999 4227 Some Place Ln. # 3 Bowling Green KY 45459 13 14

15 Bob Jones 555292999 4227 SomePlace Ln Unit 3 Bowling Green KY 45459 15 100

16 Bob Jones 555292999 4227 Some Place Ln. Ste 3 Bowling Green KY 45459 15 12

17 Louis Armstrong 147852369 123 Baldwin Ave. NULL Jacksonville FL 32209 17 100

18 Louis Armstrong 147852369 123 Baldwin Ave NULL Jacksonville FL 32209 17 13

19 Louis Armstrong 147852369 123 Baldwin Anenue NULL Jachsonville FL 32209 17 12

20 Louis Armstrong 147852639 123 Baldwin Ave. NULL Jacksonville FL 32209 17 9

21 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 100

22 Mark Adam 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 13

23 Mark Adams 654789231 524 Main St. Unit 42 Barkersville NC 33304 21 8

24 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 15

Note... Setting the weights is the important part. We spent a lot of time tweaking the various values and the total threshold to get the best possible matches based on our own data...

The proc that I use to set the left and right bowers is attached a .txt file.

HTH,

Jason

craig.bobchin SSCrazy Points: 2298 More actions · Answer 6

Thanks. That looks very similar to what I'm looking for. I'll give it a try.

Another though I had was doing a fuzzy lookup in SSIS when I load the table. I'll try both ways and see which works best.

craig.bobchin SSCrazy Points: 2298 More actions · Answer 7

Jeff Moden (8/26/2015)
I found the code I was looking for and, unfortunately, it solves for a problem quite different that I thought I remembered and doesn't come close to solving this problem.
However, I do have an idea on how to solve this one using a CTE that contains a CROSS JOIN and 6 additive case statements with an outer query that looks for a count of 4 or more and does a group by one of the IDs. I need to test it. How many rows does your original table have in it?

We only have about 30k records, so not an overwhelming about of records.

ScottPletcher SSC Guru Points: 100947 More actions · Answer 8

To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.

SELECT m1.dimMemberId, m2.dimMemberId

FROM dimMember m1

INNER JOIN dimMember m2 ON

m1.dimMemberId < m2.dimMemberId AND

CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +

CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +

CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +

CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +

CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +

CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +

CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END

>= 4

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

craig.bobchin SSCrazy Points: 2298 More actions · Answer 9

ScottPletcher (8/27/2015)
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4

This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 10

craig.bobchin (8/27/2015)
ScottPletcher (8/27/2015)
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4
This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.

Scott did it pretty much as I described including being able to eliminate roughly half the work by using a Triangular Join instead of a Full (Square) Cartesian Join.

The question now is, what do you mean by a "common value"?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

craig.bobchin SSCrazy Points: 2298 More actions · Answer 11

Jeff Moden (8/27/2015)
craig.bobchin (8/27/2015)
ScottPletcher (8/27/2015)
To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.
SELECT m1.dimMemberId, m2.dimMemberId
FROM dimMember m1
INNER JOIN dimMember m2 ON
m1.dimMemberId < m2.dimMemberId AND
CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +
CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +
CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +
CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +
CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +
CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +
CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END
>= 4
This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.
Scott did it pretty much as I described including being able to eliminate roughly half the work by using a Triangular Join instead of a Full (Square) Cartesian Join.
The question now is, what do you mean by a "common value"?

By Common Value I mean the following:

This is a real example of the data I'm looking at the data is changed but the N/As are correct:

MemberIDSubscriberCodeMembercodeSSNFirstnamelastnameAddress

123U12345N/A999-999-9999JohnDoe123 main St Los Angles CA

124U12346NAN/AJohnDoe123 main St Los Angles CA

5764U12346x0987N/AJohnDoeN/A

1126U12348x0987999-999-9999n/an/an/a

12n/aNA999-999-9999JohnDoe427 main St Los Angles CA

The Common Value would be a new Unique ID for that group of records (normal Single records would also get a unique one)

MemberIDSubscriberCodeMembercodeSSNFirstnamelastnameAddressNewID (Common Value)

123U12345N/A999-999-9999JohnDoe123 main St Los Angles CACH1234

124U12346NAN/AJohnDoe123 main St Los Angles CACH1234

5764U12346x0987N/AJohnDoeN/ACH1234

1126U12348x0987999-999-9999n/an/an/aCH1234

12n/aNA999-999-9999JohnDoe427 main St Los Angles CACH1234

3254H12342345111-111-1112JaneSmithn/aCH2398

The last record in this is a regular record and has it's own unique NewID(Common Value) while in the group above, all the associated records have the same NewID (Common Value)

I had trouble formatting the data into a table, but that is the idea.

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 12

The code I posted earlier shows how to handle the "how to assign a common value to a new field based on the matches" problem.

The idea is to find the "best" single parent/child match that exists for any given "ChildID". (Note: every row is a "child" of every preceding row and every row is a "parent" of every row that follows.) It's just a matter of finding the best parent for each child.

The rows that don't have parents (there are no rows with, the PersonID in the ChildID position, with a score high enough to meet the match threshold) end up being the "GroupID" rows. Simply put, they are the 1st occurrence of a unique individual.

This causes a natural adjacency list type hierarchy, which can be converted to a nested set hierarchy. (Easily accomplished thanks to Jeff Moden[/url])

Once the Left & Right values are updated, they are used to define a "Group".

For example...

PersonID = 1 isn't the child of any other rows so it becomes the start of a group.

PersonID = 2 matches to PersonID = 1 with a score of 13 (above the threshold) so it will be a part of GroupID = 1

PersonID = 3 is matched equally to both 1 & 2 with a score of 11 (again above the threshold) making it part of Group 1. (note... It doesn't matter which of the "parents" is selected. They are equally good.)

PersonID = 4 is matched to 1,2,&3 but the best match is to 3 with a score of 13 (above the threshold) again making it part of Group 1.

PersonID = 5 is matched to 1,2,3&4 but none of those scores are high enough to meet the threshold, so it becomes the 1st member of Group 5.

... and so on...

Applying the Lft & Rht values looks something like this...

ParentIDChildIDBestScoreLftRgt

0110029

121334

131158

341367

051001017

Everything between 2 & 9 is part of Group 1

Everything between 10 & 17 is part of group 5...

Jayanth_Kurup SSC-Insane Points: 22967 More actions · Answer 13

I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

Check out this video [/url]

All you might need is to schedule the package to run frequently.

Jayanth Kurup[/url]

craig.bobchin SSCrazy Points: 2298 More actions · Answer 14

Jayanth_Kurup (8/27/2015)
I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.
Check out this video [/url]
All you might need is to schedule the package to run frequently.

Yes I'm looking at Fuzzy Grouping in SSIS as well. In fact that was my first idea. One of my co-workers was thinking an SQL Stored Proc might be a better alternative so I'm investigating both. I do have a package built, but I'm stuck on how to assign the new column to the groupings. I've not had a need to use either fuzzy grouping and aggregate tasks in SSIS before this.

I'll take a look at the video You suggested and see if it answers my questions.