How can I group near duplicate records under a new common field?

  • Hi all,

    I've inherited a table of members that has the following structure:

    CREATE TABLE [dbo].[dimMember](

    [dimMemberId] [int] IDENTITY(1,1) NOT NULL,

    [dimSourceSystemId] [int] NOT NULL CONSTRAINT [DF_dimMember_dimSourceSystemId] DEFAULT ((-1)),

    [MemberCode] [nvarchar](50) NOT NULL,

    [FirstName] [nvarchar](250) NOT NULL,

    [LastName] [nvarchar](250) NOT NULL,

    [MiddleInitial] [nvarchar](5) NOT NULL,

    [dimDayidDOB] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayidDOB] DEFAULT ((-1)),

    [dimDayIdDeath] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdDeath] DEFAULT ((99991231)),

    [Gender] [nvarchar](10) NOT NULL,

    [SubcriberRelationship] [nvarchar](50) NOT NULL,

    [SSN] [nvarchar](25) NOT NULL CONSTRAINT [DF_dimMember_SSN] DEFAULT (N'n/a'),

    [ExecutionGUID] [uniqueidentifier] NOT NULL,

    [RowRecordedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowRecordedDate] DEFAULT (getdate()),

    [isCurrent] [bit] NOT NULL CONSTRAINT [DF_dimMember_isCurrent] DEFAULT ((1)),

    [SubscriberCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_SubscriberId] DEFAULT (N'n/a'),

    [RowUpdatedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowUpdatedDate] DEFAULT (getdate()),

    [dimDayIdSubscriberStart] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberStart] DEFAULT ((1)),

    [dimDayIdSubscriberEnd] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberEnd] DEFAULT ((99991231)),

    [Race] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Race] DEFAULT (N'n/a'),

    [Ethnicity] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Ethnicity] DEFAULT (N'n/a'),

    [PrimaryLanguage] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_PrimaryLanguage] DEFAULT (N'n/a'),

    [CurrentAddress1] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress] DEFAULT (N'n/a'),

    [CurrentAddress2] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress2] DEFAULT (N'n/a'),

    [CurrentZipCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_CurrentZipCode] DEFAULT (N'n/a'),

    CONSTRAINT [PK_dimMember] PRIMARY KEY CLUSTERED

    Based on the way the data loads into the table there's a possibility of some records being near duplicates of each other. For example, we can have a member that has records that have the same first name, last name, SSN, but different addresses, membercodes, subscribercode etc... This can happen in pretty much any variation thereof.

    What I want to do, is add a new column and use that to group the similar records under based on comparing on several columns. By this I mean, if a member matches 4 of the 7 values below with another member, we would group these:

    First Name (1st 3 characters)

    Last Name

    DOB

    CurrentAddress1

    MemberCode

    SSN

    SubscriberCode

    I'm at a loss of how to structure the SQL to update the new column in the table. Any help is appreciated.

  • Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/26/2015)


    Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.

    You're preaching to the choir on that. The team I work on is all pretty new to this company (all joined in the last 6 months) and we all see this as an issue. We didn't have a say in the table creation and are trying our best to fix the existing problems as quickly as we can. That is high on the list of things to fix.

    The main issue we have, is that while we can encrypt the SSN on our reporting database, the OLTP system we get the data (written and supported by a 3rd party vendor) from does not encrypt the SSN on their end.

  • craig.bobchin (8/26/2015)


    Jeff Moden (8/26/2015)


    Off topic a bit, I'm not sure why anyone in the right mind would store an SSN in a Unicode data type and, unless the fact that it's encrypted had just not been posted, why anyone if the right mind would store unencrypted SSNs. To be absolutely blunt, you have all the PII that hackers need in one table including unencrypted SSNs and if you get broken into and that information gets away from the company, the company can be sued into oblivion. If anyone in your company thinks (obvious play on words there) that it's ok to store SSNs in clear text, then ask them to enter all of their information, including their real SSN, into the table as a sign of good faith. I'll just bet no one takes you up on that and for good reason. It's not right and it's not safe.

    You're preaching to the choir on that. The team I work on is all pretty new to this company (all joined in the last 6 months) and we all see this as an issue. We didn't have a say in the table creation and are trying our best to fix the existing problems as quickly as we can. That is high on the list of things to fix.

    The main issue we have, is that while we can encrypt the SSN on our reporting database, the OLTP system we get the data (written and supported by a 3rd party vendor) from does not encrypt the SSN on their end.

    Been there and done that. No fun at all yet lots of fun trying to fix what a whole 'nuther team managed to mess up. And, yeah... 3rd party vendors are a joy, aren't they?

    Getting back to your problem... I see that the SSN column is NOT NULL but does have a default. For those that do have an SSN, I'd base the near duplicates on SSN only to determine the near duplicates.

    As for the rest of them and the 4 of 7 grouping, I did such a similar thing a couple of years ago and it worked very quickly. I have to see if I can find the code in my archives because it was the only time I had to do such a thing so far. IIRC, it used cascading CTEs but I don't remember the particulars.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I found the code I was looking for and, unfortunately, it solves for a problem quite different that I thought I remembered and doesn't come close to solving this problem.

    However, I do have an idea on how to solve this one using a CTE that contains a CROSS JOIN and 6 additive case statements with an outer query that looks for a count of 4 or more and does a group by one of the IDs. I need to test it. How many rows does your original table have in it?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've done something very similar but it far from fast, especially on larger sets.

    The following is a simplified version of what I did in the past but it does illustrate the premise.

    (I apologize in advance for the lack of better comments...)

    -- Create the necessary tables

    IF OBJECT_ID('tempdb..#People') IS NOT NULL

    DROP TABLE #People;

    CREATE TABLE #People (

    PersonID INT IDENTITY(1,1),

    FirstName VARCHAR(20),

    LastName VARCHAR(20),

    SSN VARCHAR(9),

    Address_1 VARCHAR(30),

    Address_2 VARCHAR(30),

    City VARCHAR(30),

    [State] CHAR(2),

    Zip CHAR(5),

    GroupID INT,

    MatchQuality INT

    );

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    FirstNameScore INT,

    LastNameScore INT,

    SSN_Score INT,

    Add1_Score INT,

    Add2_Score INT,

    CityScore INT,

    StateScore INT,

    ZipScore INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    INSERT #People (FirstName,LastName,SSN,Address_1,Address_2,City,State,Zip) VALUES

    ('Bob','Jones','333224444','258 Grants Trail','#3','Bowling Green','KY','45459'),

    ('Robert','Jones','333224444','258 Grants Trail','Ste 3','Bowling Green','KY','45459'),

    ('Rob','Jones','333224444','258 Grants Tr.',NULL,'Bowling Green','KY','45459'),

    ('R.','Jones','333224444','258 Grants Tr.','Ste 3','Bowling Green','KY','45459'),

    ('Louis','Armstrong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

    ('L.','Armstrng','258321478','123 Humming Bird Lane',NULL,'Jacksonville','FL','32207'),

    ('Louie','Armstrong','258321478','123 Humming Bird Lain',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

    ('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33304'),

    ('M.','Adams','321456987','555 Bay Meadows Dr.','# 42','Ashvlle','NC','33304'),

    ('Mark','Adams','321456987','555 Baymeadows Drive','Unit 42','Ashville','NC','33305'),

    ('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33306'),

    ('Bob','Jones','555229999','4227 Some Place Ln.','#3','Bowling Green','KY','45459'),

    ('Bob','Jones','555229999','4227 Some Place Ln.','# 3','Bowling Green','KY','45459'),

    ('Bob','Jones','555292999','4227 SomePlace Ln','Unit 3','Bowling Green','KY','45459'),

    ('Bob','Jones','555292999','4227 Some Place Ln.','Ste 3','Bowling Green','KY','45459'),

    ('Louis','Armstrong','147852369','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstrong','147852369','123 Baldwin Ave',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstrong','147852369','123 Baldwin Anenue',NULL,'Jachsonville','FL','32209'),

    ('Louis','Armstrong','147852639','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

    ('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

    ('Mark','Adam','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

    ('Mark','Adams','654789231','524 Main St.','Unit 42','Barkersville','NC','33304'),

    ('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304');

    -- Take a look at the base data.

    SELECT * FROM #People p;

    WITH ColumnScores AS (-- Assign weights to the various attributes

    SELECT

    p1.PersonID AS ParentID,

    p2.PersonID AS ChildID,

    CASE WHEN p1.FirstName = p2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN p1.LastName = p2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN p1.SSN = p2.SSN THEN 6 ELSE 0 END AS SSN_Score,

    CASE WHEN p1.Address_1 = p2.Address_1 THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN COALESCE(p1.Address_2, '') = COALESCE(p2.Address_2, '') THEN 1 ELSE 0 END AS Add2_Score,

    CASE WHEN p1.City = p2.City THEN 1 ELSE 0 END AS CityScore,

    CASE WHEN p1.State = p2.State THEN 1 ELSE 0 END AS StateScore,

    CASE WHEN p1.Zip = p2.Zip THEN 1 ELSE 0 END AS ZipScore

    FROM

    #People p1

    JOIN #People p2

    ON p1.PersonID < p2.PersonID

    )

    INSERT #MatchScores (ParentID,ChildID,FirstNameScore,LastNameScore,SSN_Score,

    Add1_Score,Add2_Score,CityScore,StateScore,ZipScore,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.CityScore,

    cs.StateScore,

    cs.ZipScore,

    cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

    cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

    cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore > 7 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    --===============================================================

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)

    UNION ALL

    SELECT DISTINCT

    0 AS ParentID,

    ms.ParentID AS ChildID,

    100 AS BestScore

    FROM #MatchScores ms

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE ms.ParentID = bm.ChildID);

    --==========================================================================

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets-- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL;

    --==========================================================================

    UPDATE p SET

    p.GroupID = bm.GroupID,

    p.MatchQuality = bm.BestScore

    FROM

    #People p

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM (

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt

    ) bm

    ON p.PersonID = bm.ChildID;

    SELECT * FROM #People p;

    The results...

    PersonID FirstName LastName SSN Address_1 Address_2 City State Zip GroupID MatchQuality

    ----------- -------------------- -------------------- --------- ------------------------------ ------------------------------ ------------------------------ ----- ----- ----------- ------------

    1 Bob Jones 333224444 258 Grants Trail #3 Bowling Green KY 45459 1 100

    2 Robert Jones 333224444 258 Grants Trail Ste 3 Bowling Green KY 45459 1 13

    3 Rob Jones 333224444 258 Grants Tr. NULL Bowling Green KY 45459 1 11

    4 R. Jones 333224444 258 Grants Tr. Ste 3 Bowling Green KY 45459 1 13

    5 Louis Armstrong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 100

    6 L. Armstrng 258321478 123 Humming Bird Lane NULL Jacksonville FL 32207 5 9

    7 Louie Armstrong 258321478 123 Humming Bird Lain NULL Jacksonville FL 32209 5 12

    8 Louis Armstong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 13

    9 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33304 9 100

    10 M. Adams 321456987 555 Bay Meadows Dr. # 42 Ashvlle NC 33304 9 10

    11 Mark Adams 321456987 555 Baymeadows Drive Unit 42 Ashville NC 33305 9 12

    12 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33306 9 14

    13 Bob Jones 555229999 4227 Some Place Ln. #3 Bowling Green KY 45459 13 100

    14 Bob Jones 555229999 4227 Some Place Ln. # 3 Bowling Green KY 45459 13 14

    15 Bob Jones 555292999 4227 SomePlace Ln Unit 3 Bowling Green KY 45459 15 100

    16 Bob Jones 555292999 4227 Some Place Ln. Ste 3 Bowling Green KY 45459 15 12

    17 Louis Armstrong 147852369 123 Baldwin Ave. NULL Jacksonville FL 32209 17 100

    18 Louis Armstrong 147852369 123 Baldwin Ave NULL Jacksonville FL 32209 17 13

    19 Louis Armstrong 147852369 123 Baldwin Anenue NULL Jachsonville FL 32209 17 12

    20 Louis Armstrong 147852639 123 Baldwin Ave. NULL Jacksonville FL 32209 17 9

    21 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 100

    22 Mark Adam 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 13

    23 Mark Adams 654789231 524 Main St. Unit 42 Barkersville NC 33304 21 8

    24 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 15

    Note... Setting the weights is the important part. We spent a lot of time tweaking the various values and the total threshold to get the best possible matches based on our own data...

    The proc that I use to set the left and right bowers is attached a .txt file.

    HTH,

    Jason

  • Thanks. That looks very similar to what I'm looking for. I'll give it a try.

    Another though I had was doing a fuzzy lookup in SSIS when I load the table. I'll try both ways and see which works best.

  • Jeff Moden (8/26/2015)


    I found the code I was looking for and, unfortunately, it solves for a problem quite different that I thought I remembered and doesn't come close to solving this problem.

    However, I do have an idea on how to solve this one using a CTE that contains a CROSS JOIN and 6 additive case statements with an outer query that looks for a count of 4 or more and does a group by one of the IDs. I need to test it. How many rows does your original table have in it?

    We only have about 30k records, so not an overwhelming about of records.

  • To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.

    SELECT m1.dimMemberId, m2.dimMemberId

    FROM dimMember m1

    INNER JOIN dimMember m2 ON

    m1.dimMemberId < m2.dimMemberId AND

    CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +

    CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +

    CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +

    CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +

    CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +

    CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +

    CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END

    >= 4

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (8/27/2015)


    To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.

    SELECT m1.dimMemberId, m2.dimMemberId

    FROM dimMember m1

    INNER JOIN dimMember m2 ON

    m1.dimMemberId < m2.dimMemberId AND

    CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +

    CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +

    CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +

    CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +

    CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +

    CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +

    CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END

    >= 4

    This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.

  • craig.bobchin (8/27/2015)


    ScottPletcher (8/27/2015)


    To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.

    SELECT m1.dimMemberId, m2.dimMemberId

    FROM dimMember m1

    INNER JOIN dimMember m2 ON

    m1.dimMemberId < m2.dimMemberId AND

    CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +

    CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +

    CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +

    CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +

    CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +

    CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +

    CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END

    >= 4

    This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.

    Scott did it pretty much as I described including being able to eliminate roughly half the work by using a Triangular Join instead of a Full (Square) Cartesian Join.

    The question now is, what do you mean by a "common value"?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/27/2015)


    craig.bobchin (8/27/2015)


    ScottPletcher (8/27/2015)


    To me it seems fairly easy to match single rows. Maybe you can just output those matches to a table and then consolidate the like ids using that result.

    SELECT m1.dimMemberId, m2.dimMemberId

    FROM dimMember m1

    INNER JOIN dimMember m2 ON

    m1.dimMemberId < m2.dimMemberId AND

    CASE WHEN LEFT(m1.FirstName, 3) = LEFT(m2.FirstName, 3) THEN 1 ELSE 0 END +

    CASE WHEN m1.LastName = m2.LastName THEN 1 ELSE 0 END +

    CASE WHEN m1.DOB = m2.DOB THEN 1 ELSE 0 END +

    CASE WHEN m1.CurrentAddress1 = m2.CurrentAddress1 THEN 1 ELSE 0 END +

    CASE WHEN m1.MemberCode = m2.MemberCode THEN 1 ELSE 0 END +

    CASE WHEN m1.SSN = m2.SSN THEN 1 ELSE 0 END +

    CASE WHEN m1.SubscriberCode = m2.SubscriberCode THEN 1 ELSE 0 END

    >= 4

    This is pretty close to what I'm looking for. I added your join login in the select as a new column called Score. Now I just need to figure out how to assign a common value to a new field based on the matches.

    Scott did it pretty much as I described including being able to eliminate roughly half the work by using a Triangular Join instead of a Full (Square) Cartesian Join.

    The question now is, what do you mean by a "common value"?

    By Common Value I mean the following:

    This is a real example of the data I'm looking at the data is changed but the N/As are correct:

    MemberIDSubscriberCodeMembercodeSSNFirstnamelastnameAddress

    123U12345N/A999-999-9999JohnDoe123 main St Los Angles CA

    124U12346NAN/AJohnDoe123 main St Los Angles CA

    5764U12346x0987N/AJohnDoeN/A

    1126U12348x0987999-999-9999n/an/an/a

    12n/aNA999-999-9999JohnDoe427 main St Los Angles CA

    The Common Value would be a new Unique ID for that group of records (normal Single records would also get a unique one)

    MemberIDSubscriberCodeMembercodeSSNFirstnamelastnameAddressNewID (Common Value)

    123U12345N/A999-999-9999JohnDoe123 main St Los Angles CACH1234

    124U12346NAN/AJohnDoe123 main St Los Angles CACH1234

    5764U12346x0987N/AJohnDoeN/ACH1234

    1126U12348x0987999-999-9999n/an/an/aCH1234

    12n/aNA999-999-9999JohnDoe427 main St Los Angles CACH1234

    3254H12342345111-111-1112JaneSmithn/aCH2398

    The last record in this is a regular record and has it's own unique NewID(Common Value) while in the group above, all the associated records have the same NewID (Common Value)

    I had trouble formatting the data into a table, but that is the idea.

  • The code I posted earlier shows how to handle the "how to assign a common value to a new field based on the matches" problem.

    The idea is to find the "best" single parent/child match that exists for any given "ChildID". (Note: every row is a "child" of every preceding row and every row is a "parent" of every row that follows.) It's just a matter of finding the best parent for each child.

    The rows that don't have parents (there are no rows with, the PersonID in the ChildID position, with a score high enough to meet the match threshold) end up being the "GroupID" rows. Simply put, they are the 1st occurrence of a unique individual.

    This causes a natural adjacency list type hierarchy, which can be converted to a nested set hierarchy. (Easily accomplished thanks to Jeff Moden[/url])

    Once the Left & Right values are updated, they are used to define a "Group".

    For example...

    PersonID = 1 isn't the child of any other rows so it becomes the start of a group.

    PersonID = 2 matches to PersonID = 1 with a score of 13 (above the threshold) so it will be a part of GroupID = 1

    PersonID = 3 is matched equally to both 1 & 2 with a score of 11 (again above the threshold) making it part of Group 1. (note... It doesn't matter which of the "parents" is selected. They are equally good.)

    PersonID = 4 is matched to 1,2,&3 but the best match is to 3 with a score of 13 (above the threshold) again making it part of Group 1.

    PersonID = 5 is matched to 1,2,3&4 but none of those scores are high enough to meet the threshold, so it becomes the 1st member of Group 5.

    ... and so on...

    Applying the Lft & Rht values looks something like this...

    ParentIDChildIDBestScoreLftRgt

    0110029

    121334

    131158

    341367

    051001017

    Everything between 2 & 9 is part of Group 1

    Everything between 10 & 17 is part of group 5...

  • I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

    Check out this video [/url]

    All you might need is to schedule the package to run frequently.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (8/27/2015)


    I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

    Check out this video [/url]

    All you might need is to schedule the package to run frequently.

    Yes I'm looking at Fuzzy Grouping in SSIS as well. In fact that was my first idea. One of my co-workers was thinking an SQL Stored Proc might be a better alternative so I'm investigating both. I do have a package built, but I'm stuck on how to assign the new column to the groupings. I've not had a need to use either fuzzy grouping and aggregate tasks in SSIS before this.

    I'll take a look at the video You suggested and see if it answers my questions.

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply