How can I group near duplicate records under a new common field?

  • Jayanth_Kurup (8/27/2015)


    I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

    Check out this video [/url]

    All you might need is to schedule the package to run frequently.

    Jayanth,

    I watched the video, and it does give me some more information. However how would I handle it in my situation where I have multiple columns that I need to match on and not all of them have valid nor consistent data between records?

    Do I need multiple fuzzy grouping transforms or do I do it in one?

    How is the canonical field value determined?

    How do I assign a new key unique to the to the grouping?

  • Jason A. Long (8/27/2015)


    The code I posted earlier shows how to handle the "how to assign a common value to a new field based on the matches" problem.

    The idea is to find the "best" single parent/child match that exists for any given "ChildID". (Note: every row is a "child" of every preceding row and every row is a "parent" of every row that follows.) It's just a matter of finding the best parent for each child.

    The rows that don't have parents (there are no rows with, the PersonID in the ChildID position, with a score high enough to meet the match threshold) end up being the "GroupID" rows. Simply put, they are the 1st occurrence of a unique individual.

    This causes a natural adjacency list type hierarchy, which can be converted to a nested set hierarchy. (Easily accomplished thanks to Jeff Moden[/url])

    Once the Left & Right values are updated, they are used to define a "Group".

    For example...

    PersonID = 1 isn't the child of any other rows so it becomes the start of a group.

    PersonID = 2 matches to PersonID = 1 with a score of 13 (above the threshold) so it will be a part of GroupID = 1

    PersonID = 3 is matched equally to both 1 & 2 with a score of 11 (again above the threshold) making it part of Group 1. (note... It doesn't matter which of the "parents" is selected. They are equally good.)

    PersonID = 4 is matched to 1,2,&3 but the best match is to 3 with a score of 13 (above the threshold) again making it part of Group 1.

    PersonID = 5 is matched to 1,2,3&4 but none of those scores are high enough to meet the threshold, so it becomes the 1st member of Group 5.

    ... and so on...

    Applying the Lft & Rht values looks something like this...

    ParentIDChildIDBestScoreLftRgt

    0110029

    121334

    131158

    341367

    051001017

    Everything between 2 & 9 is part of Group 1

    Everything between 10 & 17 is part of group 5...

    Jason,

    I modified your code for our table/fields and it doesn't quite work. I set the threshold to >=4 and of the 30k records in the table 8818 came back as Nulls in both the GroupID and MatchQuality fields. 19377 had a -1 in group ID and MatchQuality between 4 and 100.

    The 2000 had positive integers as GroupID with some grouping correctly under a group ID and others not grouping correctly.

    Not sure how I can tweak this to fix it.

  • craig.bobchin (8/28/2015)


    Jason,

    I modified your code for our table/fields and it doesn't quite work. I set the threshold to >=4 and of the 30k records in the table 8818 came back as Nulls in both the GroupID and MatchQuality fields. 19377 had a -1 in group ID and MatchQuality between 4 and 100.

    The 2000 had positive integers as GroupID with some grouping correctly under a group ID and others not grouping correctly.

    Not sure how I can tweak this to fix it.

    Give me a few minutes... I'll break up the original code into more digestible chunks and add better comments. Hopefully that'll help you to see where you made a wrong turn.

    If that doesn't work, it may be necessary to post your actual table structure along with some test data (just make sure you aren't posting peoples real information).

    BRB...

  • This 1st section is just setting up the test data and creating a couple of temp tables that'll be needed further down in the script.

    The 1st table, #People, will obviously be replaced by you you own production table.

    The next two temp tables (#MatchScores & #BestMatches) are just working tables... So, there should be no need to create them as permanent tables.

    -- Create the necessary tables

    IF OBJECT_ID('tempdb..#People') IS NOT NULL

    DROP TABLE #People;

    CREATE TABLE #People (

    PersonID INT IDENTITY(1,1),

    FirstName VARCHAR(20),

    LastName VARCHAR(20),

    SSN VARCHAR(9),

    Address_1 VARCHAR(30),

    Address_2 VARCHAR(30),

    City VARCHAR(30),

    [State] CHAR(2),

    Zip CHAR(5),

    GroupID INT,

    MatchQuality INT

    );

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    FirstNameScore INT,

    LastNameScore INT,

    SSN_Score INT,

    Add1_Score INT,

    Add2_Score INT,

    CityScore INT,

    StateScore INT,

    ZipScore INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    INSERT #People (FirstName,LastName,SSN,Address_1,Address_2,City,State,Zip) VALUES

    ('Bob','Jones','333224444','258 Grants Trail','#3','Bowling Green','KY','45459'),

    ('Robert','Jones','333224444','258 Grants Trail','Ste 3','Bowling Green','KY','45459'),

    ('Rob','Jones','333224444','258 Grants Tr.',NULL,'Bowling Green','KY','45459'),

    ('R.','Jones','333224444','258 Grants Tr.','Ste 3','Bowling Green','KY','45459'),

    ('Louis','Armstrong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

    ('L.','Armstrng','258321478','123 Humming Bird Lane',NULL,'Jacksonville','FL','32207'),

    ('Louie','Armstrong','258321478','123 Humming Bird Lain',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstong','258321478','123 Humming Bird Ln.',NULL,'Jacksonville','FL','32209'),

    ('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33304'),

    ('M.','Adams','321456987','555 Bay Meadows Dr.','# 42','Ashvlle','NC','33304'),

    ('Mark','Adams','321456987','555 Baymeadows Drive','Unit 42','Ashville','NC','33305'),

    ('Mark','Adams','321456987','555 Baymeadows Dr.','Unit 42','Ashville','NC','33306'),

    ('Bob','Jones','555229999','4227 Some Place Ln.','#3','Bowling Green','KY','45459'),

    ('Bob','Jones','555229999','4227 Some Place Ln.','# 3','Bowling Green','KY','45459'),

    ('Bob','Jones','555292999','4227 SomePlace Ln','Unit 3','Bowling Green','KY','45459'),

    ('Bob','Jones','555292999','4227 Some Place Ln.','Ste 3','Bowling Green','KY','45459'),

    ('Louis','Armstrong','147852369','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstrong','147852369','123 Baldwin Ave',NULL,'Jacksonville','FL','32209'),

    ('Louis','Armstrong','147852369','123 Baldwin Anenue',NULL,'Jachsonville','FL','32209'),

    ('Louis','Armstrong','147852639','123 Baldwin Ave.',NULL,'Jacksonville','FL','32209'),

    ('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

    ('Mark','Adam','654789321','524 Main St.','Unit 42','Bakersville','NC','33304'),

    ('Mark','Adams','654789231','524 Main St.','Unit 42','Barkersville','NC','33304'),

    ('Mark','Adams','654789321','524 Main St.','Unit 42','Bakersville','NC','33304');

    -- Take a look at the base data.

    SELECT * FROM #People p;

    ----------------------------------------

    This next section works essentially on the same "triangular join" principal that other's have posted.

    The real key here is the assigning of weights. There will be some trial and error here, as you discover that certain attributes are stronger indicators of identity than others.

    As it's posted, I kept it very simple, looking only at exact matches. In real life, I've gone as far as splitting the SSN into individual characters and matching the individual characters (giving max points to perfect matches and reduced points for partial matches to allow for transposed characters)... and removing all non-alphanumeric characters from names & addresses to help reduce the impact of typos. Anyway, you get the point... You can make it a simple or complex as you'd like...

    WITH ColumnScores AS (-- Assign weights to the various attributes

    SELECT

    p1.PersonID AS ParentID,

    p2.PersonID AS ChildID,

    CASE WHEN p1.FirstName = p2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN p1.LastName = p2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN p1.SSN = p2.SSN THEN 6 ELSE 0 END AS SSN_Score,

    CASE WHEN p1.Address_1 = p2.Address_1 THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN COALESCE(p1.Address_2, '') = COALESCE(p2.Address_2, '') THEN 1 ELSE 0 END AS Add2_Score,

    CASE WHEN p1.City = p2.City THEN 1 ELSE 0 END AS CityScore,

    CASE WHEN p1.State = p2.State THEN 1 ELSE 0 END AS StateScore,

    CASE WHEN p1.Zip = p2.Zip THEN 1 ELSE 0 END AS ZipScore

    FROM

    #People p1

    JOIN #People p2

    ON p1.PersonID < p2.PersonID

    )

    INSERT #MatchScores (ParentID,ChildID,FirstNameScore,LastNameScore,SSN_Score,

    Add1_Score,Add2_Score,CityScore,StateScore,ZipScore,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.CityScore,

    cs.StateScore,

    cs.ZipScore,

    cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

    cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    cs.FirstNameScore + cs.LastNameScore + cs.SSN_Score + cs.Add1_Score +

    cs.Add2_Score + cs.CityScore + cs.StateScore + cs.ZipScore > 7 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    Here is what the data in #MatchScores should look like at this point...

    ParentID ChildID FirstNameScore LastNameScore SSN_Score Add1_Score Add2_Score CityScore StateScore ZipScore TotalScore

    ----------- ----------- -------------- ------------- ----------- ----------- ----------- ----------- ----------- ----------- -----------

    1 2 0 2 6 2 0 1 1 1 13

    1 3 0 2 6 0 0 1 1 1 11

    1 4 0 2 6 0 0 1 1 1 11

    2 3 0 2 6 0 0 1 1 1 11

    2 4 0 2 6 0 1 1 1 1 12

    3 4 0 2 6 2 0 1 1 1 13

    5 6 0 0 6 0 1 1 1 0 9

    5 7 0 2 6 0 1 1 1 1 12

    5 8 1 0 6 2 1 1 1 1 13

    6 7 0 0 6 0 1 1 1 0 9

    6 8 0 0 6 0 1 1 1 0 9

    7 8 0 0 6 0 1 1 1 1 10

    9 10 0 2 6 0 0 0 1 1 10

    9 11 1 2 6 0 1 1 1 0 12

    9 12 1 2 6 2 1 1 1 0 14

    10 11 0 2 6 0 0 0 1 0 9

    10 12 0 2 6 0 0 0 1 0 9

    11 12 1 2 6 0 1 1 1 0 12

    13 14 1 2 6 2 0 1 1 1 14

    13 16 1 2 0 2 0 1 1 1 8

    14 16 1 2 0 2 0 1 1 1 8

    15 16 1 2 6 0 0 1 1 1 12

    17 18 1 2 6 0 1 1 1 1 13

    17 19 1 2 6 0 1 0 1 1 12

    17 20 1 2 0 2 1 1 1 1 9

    18 19 1 2 6 0 1 0 1 1 12

    21 22 1 0 6 2 1 1 1 1 13

    21 23 1 2 0 2 1 0 1 1 8

    21 24 1 2 6 2 1 1 1 1 15

    22 24 1 0 6 2 1 1 1 1 13

    23 24 1 2 0 2 1 0 1 1 8

    Note that the original 24 rows of test data has become 31 rows AND all of the matches have a score of 8 or better...

    Lowering the threshold, would have included more rows and raising it would include fewer rows.

    My choice to use ">7" was fairly arbitrary. My suggestion would be to start with a lower threshold and raise it gradually until you agree with the final grouping results.

    My next post will cover the rest of the script... To be continued...

  • Ok... Now is when we start refining the results that landed in #MatchScores. Again, the idea is the we want the single best parent for each child and those that don't have a good parent, become the beginning of their own new group.

    The following section simply grabs the distinct list of child values and their best matched parents. Note, that it is quite possible that a given child mat have 2 equally good parent candidates. In this case it doesn't matter which one is selected, we only need one of them.

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    At this point, the data in #BestMatches should look like this...

    ParentID ChildID BestScore Lft Rgt

    ----------- ----------- ----------- ----------- -----------

    1 2 13 NULL NULL

    2 3 11 NULL NULL

    3 4 13 NULL NULL

    5 6 9 NULL NULL

    5 7 12 NULL NULL

    5 8 13 NULL NULL

    9 10 10 NULL NULL

    9 11 12 NULL NULL

    9 12 14 NULL NULL

    13 14 14 NULL NULL

    15 16 12 NULL NULL

    17 18 13 NULL NULL

    18 19 12 NULL NULL

    17 20 9 NULL NULL

    21 22 13 NULL NULL

    21 23 8 NULL NULL

    21 24 15 NULL NULL

    If you're paying attention, you'll note that there are only 17 rows of data, which means we're missing 7 from our original dataset. Don't worry, we'll take care of that in the next step...

    ----------------------------------------------------------------

    At this point we have all of the children in the #BestMatches table but we still need to get the remaining 7 rows into the table...

    The reason the 7 rows were omitted in the place is due to the fact that they don't have a suitable parent in the data... So we'll also create a fictional "master parent" row to act as a temporary parent for the parent-less...

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)-- Creates the master parent row... Note the values being assigned...

    UNION ALL

    SELECT DISTINCT-- This will get the remaining "parent-less" rows. Note that the ParentID is set to 0 to match the Child = 0 of the master parent.

    0 AS ParentID,

    ms.ParentID AS ChildID,

    100 AS BestScore

    FROM #MatchScores ms

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE ms.ParentID = bm.ChildID);

    Important note on the "Master Parent" section... The ParentID MUST be set to NULL for the dbo.CreateNestedSets proc to work correctly.

    Also, the ParentID if the remaining 7 rows is set to 0 to match the ChildID of 0 on the master parent... It not imperative that you use 0, they just need to match and it has to be a value that doesn't exist in #People.PersonID.

    The reason we need to create the Master Parent is due to the fact that, while we want multiple, small hierarchies, the dbo.CreateNestedSets proc can only handle a singe hierarchy. The Master Parent simply provides temporary "common root" for processing.

    Anyway, this what #BestMatches should look like now... (the original 17 + 7 parent-less + 1 master parent)

    ParentID ChildID BestScore Lft Rgt

    ----------- ----------- ----------- ----------- -----------

    1 2 13 NULL NULL

    2 3 11 NULL NULL

    3 4 13 NULL NULL

    5 6 9 NULL NULL

    5 7 12 NULL NULL

    5 8 13 NULL NULL

    9 10 10 NULL NULL

    9 11 12 NULL NULL

    9 12 14 NULL NULL

    13 14 14 NULL NULL

    15 16 12 NULL NULL

    17 18 13 NULL NULL

    18 19 12 NULL NULL

    17 20 9 NULL NULL

    21 22 13 NULL NULL

    21 23 8 NULL NULL

    21 24 15 NULL NULL

    NULL 0 100 NULL NULL

    0 1 100 NULL NULL

    0 5 100 NULL NULL

    0 9 100 NULL NULL

    0 13 100 NULL NULL

    0 15 100 NULL NULL

    0 17 100 NULL NULL

    0 21 100 NULL NULL

    I'll cover the rest in just one more post... To be continued...

  • Ok... So this is where the real magic happens...

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets-- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL;-- We no longer need the master parent, so it gets deleted.

    There's lot's of good information about how nested sets work so I won't get into it here but I would encourage you to (at the very least) read Jeff's article Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets[/url], as it provided the basis for the dbo.CreateNestedSets proc.

    And... Love him or hate him... Joe Celko has some awesome articles on the subject...

    Now that the Lft & Rgt values have been set, and the master parent deleted, the #BestMatches table should look like this... (ORDER BY Lft)

    ParentID ChildID BestScore Lft Rgt

    ----------- ----------- ----------- ----------- -----------

    0 1 100 2 9

    1 2 13 3 8

    2 3 11 4 7

    3 4 13 5 6

    0 5 100 10 17

    5 6 9 11 12

    5 7 12 13 14

    5 8 13 15 16

    0 9 100 18 25

    9 10 10 19 20

    9 11 12 21 22

    9 12 14 23 24

    0 13 100 26 29

    13 14 14 27 28

    0 15 100 30 33

    15 16 12 31 32

    0 17 100 34 41

    17 18 13 35 38

    18 19 12 36 37

    17 20 9 39 40

    0 21 100 42 49

    21 22 13 43 44

    21 23 8 45 46

    21 24 15 47 48

    If you're not at all familiar with nested set hierarchies or how they work, take a few minutes to read the following Wikipedia article https://en.wikipedia.org/wiki/Nested_set_model, so that you have a basic understanding of how it works.

    ---------------------------------------------------

    The last & final step is to use the data from #BestMatches to update the #People table...

    UPDATE p SET

    p.GroupID = bm.GroupID,

    p.MatchQuality = bm.BestScore

    FROM

    #People p

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM (-- Derived table that pulls out just the "Parent" rows. Their PersonID will become the groups GroupID.

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2-- This is where the "Child" rows are brought into scope...

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt-- This where the power of nested sets really shines! Any row that has a Lft value (or right, both work) that's between

    -- the Grpous.Lft and Groups.Rgt is a member of that group. It's as simple as that.

    ) bm

    ON p.PersonID = bm.ChildID;

    Now that #People has been updated, lets have a look...

    PersonID FirstName LastName SSN Address_1 Address_2 City State Zip GroupID MatchQuality

    ----------- -------------------- -------------------- --------- ------------------------------ ------------------------------ ------------------------------ ----- ----- ----------- ------------

    1 Bob Jones 333224444 258 Grants Trail #3 Bowling Green KY 45459 1 100

    2 Robert Jones 333224444 258 Grants Trail Ste 3 Bowling Green KY 45459 1 13

    3 Rob Jones 333224444 258 Grants Tr. NULL Bowling Green KY 45459 1 11

    4 R. Jones 333224444 258 Grants Tr. Ste 3 Bowling Green KY 45459 1 13

    5 Louis Armstrong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 100

    6 L. Armstrng 258321478 123 Humming Bird Lane NULL Jacksonville FL 32207 5 9

    7 Louie Armstrong 258321478 123 Humming Bird Lain NULL Jacksonville FL 32209 5 12

    8 Louis Armstong 258321478 123 Humming Bird Ln. NULL Jacksonville FL 32209 5 13

    9 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33304 9 100

    10 M. Adams 321456987 555 Bay Meadows Dr. # 42 Ashvlle NC 33304 9 10

    11 Mark Adams 321456987 555 Baymeadows Drive Unit 42 Ashville NC 33305 9 12

    12 Mark Adams 321456987 555 Baymeadows Dr. Unit 42 Ashville NC 33306 9 14

    13 Bob Jones 555229999 4227 Some Place Ln. #3 Bowling Green KY 45459 13 100

    14 Bob Jones 555229999 4227 Some Place Ln. # 3 Bowling Green KY 45459 13 14

    15 Bob Jones 555292999 4227 SomePlace Ln Unit 3 Bowling Green KY 45459 15 100

    16 Bob Jones 555292999 4227 Some Place Ln. Ste 3 Bowling Green KY 45459 15 12

    17 Louis Armstrong 147852369 123 Baldwin Ave. NULL Jacksonville FL 32209 17 100

    18 Louis Armstrong 147852369 123 Baldwin Ave NULL Jacksonville FL 32209 17 13

    19 Louis Armstrong 147852369 123 Baldwin Anenue NULL Jachsonville FL 32209 17 12

    20 Louis Armstrong 147852639 123 Baldwin Ave. NULL Jacksonville FL 32209 17 9

    21 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 100

    22 Mark Adam 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 13

    23 Mark Adams 654789231 524 Main St. Unit 42 Barkersville NC 33304 21 8

    24 Mark Adams 654789321 524 Main St. Unit 42 Bakersville NC 33304 21 15

    Scroll all the way over to the right to see how everything was matched up and the quality of the match.

    Also note... That row 15 split away (taking row 16 with it) for form a new group rather than joining rows 13 & 14.

    To the human eye, it's apparent that 13, 14, 15 & 16 are the same person, but the transposition in the SSN and other inconsistencies were enough to make SQL Server "think" they are two separate individuals.

    I did that intentionally to illustrate the necessity of analyzing your data and setting score values and the threshold accordingly.

    Anyway, I hold this helps.

    If you're still having problems let me know. If Erika hasn't blown my house down, I'll try to help.

  • craig.bobchin (8/28/2015)


    Jayanth_Kurup (8/27/2015)


    I think your looking at the problem the wrong way. using t-sql to do fuzzy lookups is kind of tricky since the db is not really suited to doing pattern matches. Sure you can write really complicated code but it still wont be bullet proof. Have to explored using the Fuzzy grouping task in SSIS.

    Check out this video [/url]

    All you might need is to schedule the package to run frequently.

    Jayanth,

    I watched the video, and it does give me some more information. However how would I handle it in my situation where I have multiple columns that I need to match on and not all of them have valid nor consistent data between records?

    Do I need multiple fuzzy grouping transforms or do I do it in one?

    How is the canonical field value determined?

    How do I assign a new key unique to the to the grouping?

    If you are looking to build a dictionary then you could probably do it in one fuzzy grouping task. I havent tried multiple columns but I'll try it out and post something on the blog.

    The canonical field value is the output of the Fuzzy Grouping task and is determined on 2 criteria. Similarity and Confidence. Similarity dictates how similar two words are for example

    James Smith and James Smiht will have a high similarity ( you can set this values as shown in the video) . Confidence is where SQL tells you how confident it feels that the two words are related e.g Road and Rd or Saint and St. this is not controlled by the user.

    Once you do fuzzy grouping , you can use Fuzzy lookup to perform the mathc on the columns you need.

    Jayanth Kurup[/url]

  • Jayanth & Jason, I am out of town this weekend, and will give your suggestions a try when I get back to the office on Monday. Thanks

  • Jason A. Long (8/28/2015)


    craig.bobchin (8/28/2015)


    Jason,

    I modified your code for our table/fields and it doesn't quite work. I set the threshold to >=4 and of the 30k records in the table 8818 came back as Nulls in both the GroupID and MatchQuality fields. 19377 had a -1 in group ID and MatchQuality between 4 and 100.

    The 2000 had positive integers as GroupID with some grouping correctly under a group ID and others not grouping correctly.

    Not sure how I can tweak this to fix it.

    Give me a few minutes... I'll break up the original code into more digestible chunks and add better comments. Hopefully that'll help you to see where you made a wrong turn.

    If that doesn't work, it may be necessary to post your actual table structure along with some test data (just make sure you aren't posting peoples real information).

    BRB...

    I'd posted the actual table structure in my 1st post, but here it is again for Convience.

    CREATE TABLE [dbo].[dimMember](

    [dimMemberId] [int] IDENTITY(1,1) NOT NULL,

    [dimSourceSystemId] [int] NOT NULL CONSTRAINT [DF_dimMember_dimSourceSystemId] DEFAULT ((-1)),

    [MemberCode] [nvarchar](50) NOT NULL,

    [FirstName] [nvarchar](250) NOT NULL,

    [LastName] [nvarchar](250) NOT NULL,

    [MiddleInitial] [nvarchar](5) NOT NULL,

    [dimDayidDOB] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayidDOB] DEFAULT ((-1)),

    [dimDayIdDeath] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdDeath] DEFAULT ((99991231)),

    [Gender] [nvarchar](10) NOT NULL,

    [SubcriberRelationship] [nvarchar](50) NOT NULL,

    [SSN] [nvarchar](25) NOT NULL CONSTRAINT [DF_dimMember_SSN] DEFAULT (N'n/a'),

    [ExecutionGUID] [uniqueidentifier] NOT NULL,

    [RowRecordedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowRecordedDate] DEFAULT (getdate()),

    [isCurrent] [bit] NOT NULL CONSTRAINT [DF_dimMember_isCurrent] DEFAULT ((1)),

    [SubscriberCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_SubscriberId] DEFAULT (N'n/a'),

    [RowUpdatedDate] [datetime] NOT NULL CONSTRAINT [DF_dimMember_RowUpdatedDate] DEFAULT (getdate()),

    [dimDayIdSubscriberStart] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberStart] DEFAULT ((1)),

    [dimDayIdSubscriberEnd] [int] NOT NULL CONSTRAINT [DF_dimMember_dimDayIdSubscriberEnd] DEFAULT ((99991231)),

    [Race] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Race] DEFAULT (N'n/a'),

    [Ethnicity] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_Ethnicity] DEFAULT (N'n/a'),

    [PrimaryLanguage] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_PrimaryLanguage] DEFAULT (N'n/a'),

    [CurrentAddress1] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress] DEFAULT (N'n/a'),

    [CurrentAddress2] [nvarchar](250) NOT NULL CONSTRAINT [DF_dimMember_CurrentAddress2] DEFAULT (N'n/a'),

    [CurrentZipCode] [nvarchar](50) NOT NULL CONSTRAINT [DF_dimMember_CurrentZipCode] DEFAULT (N'n/a'),

    CONSTRAINT [PK_dimMember] PRIMARY KEY CLUSTERED

    And here are 50 real records with SSNs, Names and Addresses cleaned.

    dimMemberIddimSourceSystemIdMemberCodeSSN_cleanSubscriberCodeFirstNameLastNameCurrentAddress1CurrentAddress2dimDayidDOB

    271979000000000An/a131270013DWIGHTDAVIS13541 MCGyvern/a19520719

    31333000000000An/a131270013DWIGHTDAVISn/an/a19520719

    101873001188830An/a83450002ValerieCarsonn/an/a19260224

    4029610002328534A893682519U1080380201LorraineMurphy7553 JEAN STn/a19430617

    10477326951540185011698112430004TimJohnstonn/an/a19721221

    50893003227819A39018178783370010AbrahamNashn/an/a19320728

    4875510003423059A960870920U1080652401MalcolmChapmann/an/a19470212

    261669003523491An/a143410021CamilleHernandez50339 DRAKE DRn/a19430909

    12483003523491An/a143410021CamilleHernandezn/an/a19430909

    35973005162758D688404262070110001DoyleDiazn/an/a19231219

    48633005164084A11701687181220010BradleyHaynesn/an/a19130414

    56693007129199C1625782583102450005MarshaWatkinsn/an/a19550519

    11833374460274934833093122020001AustinCurtisn/an/a19671007

    282249008267625An/a142550001StuartBurton632 Whitewood DR# 19419340627

    54623008267625An/a142550001StuartBurtonn/an/a19340627

    33233008300514An/a62140007WayneTorresn/an/a19410621

    45513009140867A11122473582420016WallaceOwenn/an/a19270323

    32223010096833D672071570081220019GlenStonen/an/a19141010

    32253010101062D54117567580860001KathleenGarcian/an/a19210728

    62013010300959A19917303991210009JuneFieldsn/an/a19390425

    101593011265529A20080431470240007PeteChambersn/an/a19100812

    123703011707399A21190858181910007SpencerRogersn/an/a19850625

    4893310012806835M149806974U1080170301MelodyNicholsn/an/a19340122

    60353013142597A918319918102710014EdithRiveran/an/a19210221

    274139013760369Mn/a132050013LarryByrd6120 MIDDLEBELT 610n/a19270616

    36103013760369Mn/a132050013LarryByrdn/an/a19270616

    103923014072336C1n/a103480006ArthurDeann/an/a19530901

    50603014243998An/a111460001LeoGuzmann/an/a19301127

    17713015248827C1199160226103270003KimFletchern/an/a19580609

    264409015248827C1n/a103270003KimFletcher1274 LIBRARY STREETn/a19580609

    315939015406256An/a143360013TerenceLloyd11391 NOTTINGHAM RDn/a19420409

    120473015406256An/a143360013TerenceLloydn/an/a19420409

    110163015761904M10019973070430007HarrietBarbern/an/a19210829

    46843017166029D7139045616101390014MartinWebbn/an/a19240418

    53073018189654A20760687183010018AndrewDelgadon/an/a19210108

    103993019125304C1n/a111730022FredaPricen/an/a19490821

    1763019267292A124653445102770004CarolynWagnern/an/a19330426

    89063019521356An/a112720015RonTaylorn/an/a19590407

    273169020469264Mn/a131150001SheliaMorales13101 FREUD APT 309n/a19420428

    33843020469264Mn/a131150001SheliaMoralesn/an/a19420428

    318049021146794D2n/a141180006ErnestQuinn307 SALLIOTTE RDAPT 20519450708

    125013021146794D2n/a141180006ErnestQuinnn/an/a19450708

    6093021181540D93463842081190001NealJenkinsn/an/a19210310

    318229021401723An/a111730011GarrettMaxwell26601 COERLIDGE WAYn/a19520228

    125343021401723An/a111730011GarrettMaxwelln/an/a19520228

    573023181017A69289280073650020SoniaLittlen/an/a19210929

    124063023305600A61962504081500003NoelMartinezn/an/a19420627

    290289023626262An/a132120009MarcellaFranklin16031 BEECH DALY RDTRLR 13719811213

    69383023626262An/a132120009MarcellaFranklinn/an/a19811213

    4191910023626262A190816359U1080337801MarcellaFranklin16031 BEECH DALY RD TRLR 137n/a19811213

  • Please format your test data in the form an insert statement. I can't do anything with in the form it's in now.

  • Jason A. Long (8/31/2015)


    Please format your test data in the form an insert statement. I can't do anything with in the form it's in now.

    Sorry, Here it is:

    INSERT INTO [dbo].[dimMember]

    ([dimSourceSystemId]

    ,[MemberCode]

    ,[SSN]

    ,[SubscriberCode]

    ,[FirstName]

    ,[LastName]

    ,[CurrentAddress1]

    ,[CurrentAddress2]

    ,[dimDayidDOB]

    )

    VALUES

    ('9','000000000A','n/a','131270013','Miguel','Gardner','13541 MCGyver','n/a','19520719'),

    ('3','000000000A','n/a','131270013','Miguel','Gardner','n/a','n/a','19520719'),

    ('3','001188830A','n/a','83450002','Valerie','Carson','n/a','n/a','19260224'),

    ('10','002328534A','893682519','U1080380201','Lorraine','Murphy','7553 JEAN ST','n/a','19430617'),

    ('3','26951540','185011698','112430004','Tim','Johnston','n/a','n/a','19721221'),

    ('3','003227819A','390181787','83370010','Abraham','Nash','n/a','n/a','19320728'),

    ('10','003423059A','960870920','U1080652401','Malcolm','Chapman','n/a','n/a','19470212'),

    ('9','003523491A','n/a','143410021','Camille','Hernandez','50339 DRAKE DR','n/a','19430909'),

    ('3','003523491A','n/a','143410021','Camille','Hernandez','n/a','n/a','19430909'),

    ('3','005162758D6','884042620','70110001','Doyle','Diaz','n/a','n/a','19231219'),

    ('3','005164084A','117016871','81220010','Bradley','Haynes','n/a','n/a','19130414'),

    ('3','007129199C1','625782583','102450005','Marsha','Watkins','n/a','n/a','19550519'),

    ('3','74460274','934833093','122020001','Austin','Curtis','n/a','n/a','19671007'),

    ('9','008267625A','n/a','142550001','Stuart','Burton','632 Whitewood DR','# 194','19340627'),

    ('3','008267625A','n/a','142550001','Stuart','Burton','n/a','n/a','19340627'),

    ('3','008300514A','n/a','62140007','Wayne','Torres','n/a','n/a','19410621'),

    ('3','009140867A','111224735','82420016','Wallace','Owen','n/a','n/a','19270323'),

    ('3','010096833D6','720715700','81220019','Glen','Stone','n/a','n/a','19141010'),

    ('3','010101062D','541175675','80860001','Kathleen','Garcia','n/a','n/a','19210728'),

    ('3','010300959A','199173039','91210009','June','Fields','n/a','n/a','19390425'),

    ('3','011265529A','200804314','70240007','Pete','Chambers','n/a','n/a','19100812'),

    ('3','011707399A','211908581','81910007','Spencer','Rogers','n/a','n/a','19850625'),

    ('10','012806835M','149806974','U1080170301','Melody','Nichols','n/a','n/a','19340122'),

    ('3','013142597A','918319918','102710014','Edith','Rivera','n/a','n/a','19210221'),

    ('9','013760369M','n/a','132050013','Larry','Byrd','6120 MIDDLEBELT 610','n/a','19270616'),

    ('3','013760369M','n/a','132050013','Larry','Byrd','n/a','n/a','19270616'),

    ('3','014072336C1','n/a','103480006','Arthur','Dean','n/a','n/a','19530901'),

    ('3','014243998A','n/a','111460001','Leo','Guzman','n/a','n/a','19301127'),

    ('3','015248827C1','199160226','103270003','Kim','Fletcher','n/a','n/a','19580609'),

    ('9','015248827C1','n/a','103270003','Kim','Fletcher','1274 LIBRARY STREET','n/a','19580609'),

    ('9','015406256A','n/a','143360013','Terence','Lloyd','11391 NOTTINGHAM RD','n/a','19420409'),

    ('3','015406256A','n/a','143360013','Terence','Lloyd','n/a','n/a','19420409'),

    ('3','015761904M','100199730','70430007','Harriet','Barber','n/a','n/a','19210829'),

    ('3','017166029D7','139045616','101390014','Martin','Webb','n/a','n/a','19240418'),

    ('3','018189654A','207606871','83010018','Andrew','Delgado','n/a','n/a','19210108'),

    ('3','019125304C1','n/a','111730022','Freda','Price','n/a','n/a','19490821'),

    ('3','019267292A','124653445','102770004','Carolyn','Wagner','n/a','n/a','19330426'),

    ('3','019521356A','n/a','112720015','Ron','Taylor','n/a','n/a','19590407'),

    ('9','020469264M','n/a','131150001','Shelia','Morales','13101 FREUD APT 309','n/a','19420428'),

    ('3','020469264M','n/a','131150001','Shelia','Morales','n/a','n/a','19420428'),

    ('9','021146794D2','n/a','141180006','Ernest','Quinn','307 SALLIOTTE RD','APT 205','19450708'),

    ('3','021146794D2','n/a','141180006','Ernest','Quinn','n/a','n/a','19450708'),

    ('3','021181540D','934638420','81190001','Neal','Jenkins','n/a','n/a','19210310'),

    ('9','021401723A','n/a','111730011','Garrett','Maxwell','26601 COERLIDGE WAY','n/a','19520228'),

    ('3','021401723A','n/a','111730011','Garrett','Maxwell','n/a','n/a','19520228'),

    ('3','023181017A','692892800','73650020','Sonia','Little','n/a','n/a','19210929'),

    ('3','023305600A','619625040','81500003','Noel','Martinez','n/a','n/a','19420627'),

    ('9','023626262A','n/a','132120009','Marcella','Franklin','16031 BEECH DALY RD','TRLR 137','19811213'),

    ('3','023626262A','n/a','132120009','Marcella','Franklin','n/a','n/a','19811213'),

    ('10','023626262A','190816359','U1080337801','Marcella','Franklin','16031 BEECH DALY RD TRLR 137','n/a','19811213'),

  • See what you think of this...

    -- Add the two new columns to dimMember...

    --ALTER TABLE dbo.dimMember ADD GroupID INT;

    --ALTER TABLE dbo.dimMember ADD MatchQuality INT;

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    MemberCodeScore INT,

    FirstNameScore INT,

    LastNameScore INT,

    DOB_Score INT,

    SSN_Score INT,

    SubscriberCodeScore INT,

    Add1_Score INT,

    Add2_Score INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    -- Take a look at the base data.

    --SELECT * FROM dbo.dimMember dm;

    WITH ColumnScores AS (-- Assign weights to the various attributes

    SELECT

    dm1.dimMemberId AS ParentID,

    dm2.dimMemberId AS ChildID,

    CASE WHEN dm1.MemberCode = dm2.MemberCode THEN 3 ELSE 0 END AS MemberCodeScore,

    CASE WHEN dm1.FirstName = dm2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN dm1.LastName = dm2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN dm1.dimDayidDOB = dm2.dimDayidDOB THEN 1 ELSE 0 END AS DOB_Score,

    CASE WHEN NULLIF(dm1.SSN, 'n/a') = NULLIF(dm2.SSN, 'n/a') THEN 6 ELSE 0 END AS SSN_Score, -- Using NULLIF to prevent "n/a" from causing valid matches.

    CASE WHEN dm1.SubscriberCode = dm2.SubscriberCode THEN 1 ELSE 0 END AS SubscriberCodeScore,

    CASE WHEN NULLIF(dm1.CurrentAddress1, 'n/a') = NULLIF(dm2.CurrentAddress1, 'n/a') THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN NULLIF(dm1.CurrentAddress2, 'n/a') = NULLIF(dm2.CurrentAddress2, 'n/a') THEN 1 ELSE 0 END AS Add2_Score

    FROM

    dbo.dimMember dm1

    JOIN dbo.dimMember dm2

    ON dm1.dimMemberId < dm2.dimMemberId

    )

    INSERT #MatchScores (ParentID,ChildID,MemberCodeScore,FirstNameScore,LastNameScore,SSN_Score,SubscriberCodeScore,

    Add1_Score,Add2_Score,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.MemberCodeScore,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.SubscriberCodeScore,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score > 4 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    --===============================================================

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)

    UNION ALL

    SELECT DISTINCT

    0 AS ParentID,

    dm.dimMemberId AS ChildID,

    100 AS BestScore

    FROM dbo.dimMember dm-- Change from the original script... Pulling the "Parent-less" from dimMember instead of #BestMatches.

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE dm.dimMemberId = bm.ChildID);

    --==========================================================================

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets-- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL;

    --==========================================================================

    UPDATE dm SET

    dm.GroupID = bm.GroupID,

    dm.MatchQuality = bm.BestScore

    FROM

    dbo.dimMember dm

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM (

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt

    ) bm

    ON dm.dimMemberId = bm.ChildID;

    SELECT * FROM dbo.dimMember dm

    See the attached spreadsheet for the results...

  • Jason A. Long (8/31/2015)


    See what you think of this...

    -- Add the two new columns to dimMember...

    --ALTER TABLE dbo.dimMember ADD GroupID INT;

    --ALTER TABLE dbo.dimMember ADD MatchQuality INT;

    IF OBJECT_ID('tempdb..#MatchScores') IS NOT NULL

    DROP TABLE #MatchScores;

    CREATE TABLE #MatchScores (

    ParentID INT,

    ChildID INT,

    MemberCodeScore INT,

    FirstNameScore INT,

    LastNameScore INT,

    DOB_Score INT,

    SSN_Score INT,

    SubscriberCodeScore INT,

    Add1_Score INT,

    Add2_Score INT,

    TotalScore INT

    );

    IF OBJECT_ID('tempdb..#BestMatches') IS NOT NULL

    DROP TABLE #BestMatches;

    CREATE TABLE #BestMatches (

    ParentID INT,

    ChildID INT,

    BestScore INT,

    Lft INT,

    Rgt INT

    );

    -- Take a look at the base data.

    --SELECT * FROM dbo.dimMember dm;

    WITH ColumnScores AS (-- Assign weights to the various attributes

    SELECT

    dm1.dimMemberId AS ParentID,

    dm2.dimMemberId AS ChildID,

    CASE WHEN dm1.MemberCode = dm2.MemberCode THEN 3 ELSE 0 END AS MemberCodeScore,

    CASE WHEN dm1.FirstName = dm2.FirstName THEN 1 ELSE 0 END AS FirstNameScore,

    CASE WHEN dm1.LastName = dm2.LastName THEN 2 ELSE 0 END AS LastNameScore,

    CASE WHEN dm1.dimDayidDOB = dm2.dimDayidDOB THEN 1 ELSE 0 END AS DOB_Score,

    CASE WHEN NULLIF(dm1.SSN, 'n/a') = NULLIF(dm2.SSN, 'n/a') THEN 6 ELSE 0 END AS SSN_Score, -- Using NULLIF to prevent "n/a" from causing valid matches.

    CASE WHEN dm1.SubscriberCode = dm2.SubscriberCode THEN 1 ELSE 0 END AS SubscriberCodeScore,

    CASE WHEN NULLIF(dm1.CurrentAddress1, 'n/a') = NULLIF(dm2.CurrentAddress1, 'n/a') THEN 2 ELSE 0 END AS Add1_Score,

    CASE WHEN NULLIF(dm1.CurrentAddress2, 'n/a') = NULLIF(dm2.CurrentAddress2, 'n/a') THEN 1 ELSE 0 END AS Add2_Score

    FROM

    dbo.dimMember dm1

    JOIN dbo.dimMember dm2

    ON dm1.dimMemberId < dm2.dimMemberId

    )

    INSERT #MatchScores (ParentID,ChildID,MemberCodeScore,FirstNameScore,LastNameScore,SSN_Score,SubscriberCodeScore,

    Add1_Score,Add2_Score,TotalScore)

    SELECT

    cs.ParentID,

    cs.ChildID,

    cs.MemberCodeScore,

    cs.FirstNameScore,

    cs.LastNameScore,

    cs.SSN_Score,

    cs.SubscriberCodeScore,

    cs.Add1_Score,

    cs.Add2_Score,

    cs.MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score AS TotalScore

    FROM

    ColumnScores cs

    WHERE

    MemberCodeScore + cs.FirstNameScore + cs.LastNameScore + cs.DOB_Score + cs.SSN_Score + SubscriberCodeScore + cs.Add1_Score +

    cs.Add2_Score > 4 -- Set a threashold for total match quality

    ORDER BY

    cs.ParentID;

    --===============================================================

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT

    bs.ParentID,

    bs.ChildID,

    bs.TotalScore AS BestScore

    FROM

    (SELECT DISTINCT ms1.ChildID

    FROM #MatchScores ms1) c

    CROSS APPLY (

    SELECT TOP 1

    *

    FROM #MatchScores ms2

    WHERE c.ChildID = ms2.ChildID

    ORDER BY ms2.TotalScore DESC

    ) bs;

    INSERT #BestMatches (ParentID,ChildID,BestScore)

    SELECT ParentID,ChildID,BestScore FROM (VALUES (NULL, 0, 100)) x (ParentID,ChildID,BestScore)

    UNION ALL

    SELECT DISTINCT

    0 AS ParentID,

    dm.dimMemberId AS ChildID,

    100 AS BestScore

    FROM dbo.dimMember dm-- Change from the original script... Pulling the "Parent-less" from dimMember instead of #BestMatches.

    WHERE NOT EXISTS (SELECT 1 FROM #BestMatches bm WHERE dm.dimMemberId = bm.ChildID);

    --==========================================================================

    -- Use nested sets to create the groups

    EXEC dbo.CreateNestedSets-- This proc is simply a dynamic version of Jeff Modens "Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets" script.

    @TableName = '#BestMatches',

    @ChildColName = 'ChildID',

    @ParentColName = 'ParentID',

    @LeftColName = 'Lft',

    @RightColName = 'Rgt',

    @DeBug = 0;

    DELETE #BestMatches WHERE ParentID IS NULL;

    --==========================================================================

    UPDATE dm SET

    dm.GroupID = bm.GroupID,

    dm.MatchQuality = bm.BestScore

    FROM

    dbo.dimMember dm

    JOIN (

    SELECT

    Groups.ChildID AS GroupID,

    bm2.ChildID,

    bm2.BestScore

    FROM (

    SELECT

    bm1.ChildID,

    bm1.Lft,

    bm1.Rgt

    FROM

    #BestMatches bm1

    WHERE

    bm1.ParentID = 0

    ) Groups

    JOIN #BestMatches bm2

    ON bm2.Lft BETWEEN Groups.Lft AND Groups.Rgt

    ) bm

    ON dm.dimMemberId = bm.ChildID;

    SELECT * FROM dbo.dimMember dm

    See the attached spreadsheet for the results...

    Thanks Jason, I do actually have the fields in the table, I just copied the earlier version of the create table script for dimMembers.

    I'll check the code against the earlier version tomorrow morning. The spreadsheet data looks good. I've been running the original version of the script today with various thresholds and as expected getting wildly varied results.

    One thing I'm seeing is that of the 30k rows in the table almost 50% have either a -1 or a NULL in the GroupID field. I'm not sure what the deal is with that.

  • craig.bobchin (8/31/2015)


    Thanks Jason, I do actually have the fields in the table, I just copied the earlier version of the create table script for dimMembers.

    I'll check the code against the earlier version tomorrow morning. The spreadsheet data looks good. I've been running the original version of the script today with various thresholds and as expected getting wildly varied results.

    One thing I'm seeing is that of the 30k rows in the table almost 50% have either a -1 or a NULL in the GroupID field. I'm not sure what the deal is with that.

    No problem... I did see where my original code was getting NULLs based on your sample data. I found the issue and correct that in the latest code. I have absolutely no idea how you're getting -1 for anything. The GroupID is based on existing dimMemberIDs... So unless dimMemberID = -1 exists in your data, that shouldn't be possible...

    In any case, take the new version of the code for a spin and see if it gets you a little closer.

Viewing 15 posts - 16 through 30 (of 33 total)

You must be logged in to reply to this topic. Login to reply