stored procedure for tagging same contact information

  • Hi,

    I'm stumped at how to create a stored procedure that would group and identify contact information.

    I have a table with the following columns

    Name

    Phone

    Email

    Group

    Example of desired output

    Name Phone Email Group

    Alex 123 abc@domain 1

    Alex 456 def@domain 1

    Brian 123 ghi@domain 1

    Carlos 890 ghi@domain 1

    Diana 777 hij@domain 2

    first and second row belong to same group because they have similar name

    third row is part of first group because phone number matches one of the phones of the first contact

    fourth row is part of first group because email matches the email address in row 3

    fifth row is part of a new group because it doesn't match any of the columns

  • -- Set up some sample data to test against

    -- This is not part of the solution

    -- Note: the ID column is required, you could generate it on-the-fly using ROW_NUMBER()

    SELECT ID, [Name], Phone, Email, [Group]

    INTO #Temp

    FROM (VALUES

    (1, 'Alex', 123, 'abc@domain', 1),

    (2, 'Alex', 456, 'def@domain', 1),

    (3, 'Brian', 123, 'ghi@domain', 1),

    (4, 'Carlos', 890, 'ghi@domain', 1),

    (5, 'Diana', 777, 'hij@domain', 2)

    ) d (ID, [Name], Phone, Email, [Group])

    -- This is a solution

    -- You will have to renumber the Grp column using DENSE_RANK.

    ;WITH Firstpass AS (

    SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

    Grp = CASE WHEN t1.ID < x.ID OR x.ID IS NULL THEN t1.ID ELSE x.ID END

    FROM #Temp t1

    OUTER APPLY (

    SELECT ID = MIN(ID)

    FROM #Temp t2

    WHERE t2.ID < t1.ID AND t2.[Name] = t1.[Name]

    ) x

    ),

    Secondpass AS (

    SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

    Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

    FROM Firstpass t1

    OUTER APPLY (

    SELECT Grp = MIN(Grp)

    FROM Firstpass t2

    WHERE t2.Grp < t1.Grp AND (t2.Phone = t1.Phone)

    ) x

    )

    SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

    Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

    FROM Secondpass t1

    OUTER APPLY (

    SELECT Grp = MIN(Grp)

    FROM Secondpass t2

    WHERE t2.Grp < t1.Grp AND (t2.Email = t1.Email)

    ) x

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Here's a slightly different version, replacing table reads with table spools:

    WITH

    Firstpass AS (SELECT ID, [Name], Phone, Email, [Group],

    grp1 = MIN(ID) OVER(PARTITION BY [Name]) FROM #Temp),

    Secondpass AS (SELECT ID, [Name], Phone, Email, [Group],

    grp2 = MIN(grp1) OVER(PARTITION BY Phone) FROM Firstpass),

    Thirdpass AS (SELECT ID, [Name], Phone, Email, [Group],

    grp3 = MIN(grp2) OVER(PARTITION BY Email) FROM Secondpass)

    SELECT ID, [Name], Phone, Email, [Group],

    FinalGroup = DENSE_RANK() OVER(ORDER BY grp3) FROM Thirdpass

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:

    1. Sort the whole list by name

    2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)

    3. Sort the whole list by phone

    4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records

    5. Sort the whole list by email

    5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records

    Thanks

  • raymond.wee.823 (9/2/2016)


    Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:

    1. Sort the whole list by name

    2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)

    3. Sort the whole list by phone

    4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records

    5. Sort the whole list by email

    5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records

    Thanks

    Try the second method I posted, Raymond. In my test harness it runs through 20,000 rows in less than a second.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • raymond.wee.823 (9/2/2016)


    Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:

    1. Sort the whole list by name

    2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)

    3. Sort the whole list by phone

    4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records

    5. Sort the whole list by email

    5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records

    Thanks

    If you break out those chained CTEs into #temp tables it takes 3s for 20,000 rows:

    DROP TABLE #Firstpass

    DROP TABLE #Secondpass

    SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

    Grp = CASE WHEN t1.ID < x.ID OR x.ID IS NULL THEN t1.ID ELSE x.ID END

    INTO #Firstpass

    FROM #Temp t1

    OUTER APPLY (

    SELECT ID = MIN(ID)

    FROM #Temp t2

    WHERE t2.ID < t1.ID AND t2.[Name] = t1.[Name]

    ) x

    SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

    Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

    INTO #Secondpass

    FROM #Firstpass t1

    OUTER APPLY (

    SELECT Grp = MIN(Grp)

    FROM #Firstpass t2

    WHERE t2.Grp < t1.Grp AND (t2.Phone = t1.Phone)

    ) x

    SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

    Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

    FROM #Secondpass t1

    OUTER APPLY (

    SELECT Grp = MIN(Grp)

    FROM #Secondpass t2

    WHERE t2.Grp < t1.Grp AND (t2.Email = t1.Email)

    ) x

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • What should the result be if a row matches two or more established groups? For example, there is a 'Smith' group and a 'Jones' group; then, you get a row that has the name Smith, but a phone number already associated with the Jones group.

    Can a row belong to more than one group?

    If not, is there a priority to the columns for assigning it to one group?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (9/2/2016)


    What should the result be if a row matches two or more established groups? For example, there is a 'Smith' group and a 'Jones' group; then, you get a row that has the name Smith, but a phone number already associated with the Jones group.

    Can a row belong to more than one group?

    If not, is there a priority to the columns for assigning it to one group?

    As I understand it Bob, the Smith group join the Jones group (or vice versa, depending on which has the lowest group number).

    I've seen this requirement before - identification of "Family" or "Household" in marketing databases.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • So you have to try all possible matches and just take the one with the lowest group number?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (9/5/2016)


    So you have to try all possible matches and just take the one with the lowest group number?

    That's a fair description of what the code does - but to make it clear, it matches each column sequentially.

    Group by name, assign the lowest ID in the partition to each row within it (as grp1). Singletons will have grp1 equal to their ID.

    Group by phone assigning the lowest [grp1] to each row in a partition as [grp2].

    Group by email, assign the lowest [grp2] to each row in a partition, as [grp3], the final group.

    Best demonstrated with some code:

    DROP TABLE #Temp

    SELECT ID, [Name], Phone, Email, [Group]

    INTO #Temp

    FROM (VALUES

    (1, 'Alex', 123, 'abc@domain', 1),

    (2, 'Alex', 456, 'def@domain', 1),

    (3, 'Brian', 123, 'ghi@domain', 1),

    (4, 'Carlos', 890, 'ghi@domain', 1),

    (5, 'Diana', 777, 'hij@domain', 2),

    (6, 'Dave', 123, 'hijkl@domain', 1),

    (7, 'Robert', 123, 'xptl@domain', 1),

    (8, 'Diane', 778, 'hij@domain', 2)

    ) d (ID, [Name], Phone, Email, [Group]);

    WITH

    Firstpass AS (SELECT ID, [Name], Phone, Email, [Group],

    grp1 = MIN(ID) OVER(PARTITION BY [Name]) FROM #Temp),

    Secondpass AS (SELECT ID, [Name], Phone, Email, [Group], grp1,

    grp2 = MIN(grp1) OVER(PARTITION BY Phone) FROM Firstpass),

    Thirdpass AS (SELECT ID, [Name], Phone, Email, [Group], grp1, grp2,

    grp3 = MIN(grp2) OVER(PARTITION BY Email) FROM Secondpass)

    SELECT ID, [Name], Phone, Email, [Group], grp1, grp2, grp3,

    FinalGroup = DENSE_RANK() OVER(ORDER BY grp3) FROM Thirdpass

    ORDER BY ID

    Edit: this type of problem is frequently encountered (and hence solved) in marketing - I wonder if there's a name for it?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply