stored procedure for tagging same contact information

Question

Post reply

stored procedure for tagging same contact information

raymond.wee.823

Grasshopper

Points: 18
More actions
September 1, 2016 at 11:48 pm

#312672

Hi,
I'm stumped at how to create a stored procedure that would group and identify contact information.
I have a table with the following columns
Name
Phone
Email
Group
Example of desired output
Name Phone Email Group
Alex 123 abc@domain 1
Alex 456 def@domain 1
Brian 123 ghi@domain 1
Carlos 890 ghi@domain 1
Diana 777 hij@domain 2
first and second row belong to same group because they have similar name
third row is part of first group because phone number matches one of the phones of the first contact
fourth row is part of first group because email matches the email address in row 3
fifth row is part of a new group because it doesn't match any of the columns

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

-- Set up some sample data to test against

-- This is not part of the solution

-- Note: the ID column is required, you could generate it on-the-fly using ROW_NUMBER()

SELECT ID, [Name], Phone, Email, [Group]

INTO #Temp

FROM (VALUES

(1, 'Alex', 123, 'abc@domain', 1),

(2, 'Alex', 456, 'def@domain', 1),

(3, 'Brian', 123, 'ghi@domain', 1),

(4, 'Carlos', 890, 'ghi@domain', 1),

(5, 'Diana', 777, 'hij@domain', 2)

) d (ID, [Name], Phone, Email, [Group])

-- This is a solution

-- You will have to renumber the Grp column using DENSE_RANK.

;WITH Firstpass AS (

SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

Grp = CASE WHEN t1.ID < x.ID OR x.ID IS NULL THEN t1.ID ELSE x.ID END

FROM #Temp t1

OUTER APPLY (

SELECT ID = MIN(ID)

FROM #Temp t2

WHERE t2.ID < t1.ID AND t2.[Name] = t1.[Name]

) x

),

Secondpass AS (

SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

FROM Firstpass t1

OUTER APPLY (

SELECT Grp = MIN(Grp)

FROM Firstpass t2

WHERE t2.Grp < t1.Grp AND (t2.Phone = t1.Phone)

) x

)

SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

FROM Secondpass t1

OUTER APPLY (

SELECT Grp = MIN(Grp)

FROM Secondpass t2

WHERE t2.Grp < t1.Grp AND (t2.Email = t1.Email)

) x

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 2

Here's a slightly different version, replacing table reads with table spools:

WITH

Firstpass AS (SELECT ID, [Name], Phone, Email, [Group],

grp1 = MIN(ID) OVER(PARTITION BY [Name]) FROM #Temp),

Secondpass AS (SELECT ID, [Name], Phone, Email, [Group],

grp2 = MIN(grp1) OVER(PARTITION BY Phone) FROM Firstpass),

Thirdpass AS (SELECT ID, [Name], Phone, Email, [Group],

grp3 = MIN(grp2) OVER(PARTITION BY Email) FROM Secondpass)

SELECT ID, [Name], Phone, Email, [Group],

FinalGroup = DENSE_RANK() OVER(ORDER BY grp3) FROM Thirdpass

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

raymond.wee.823 Grasshopper Points: 18 More actions · Answer 3

Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:

1. Sort the whole list by name

2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)

3. Sort the whole list by phone

4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records

5. Sort the whole list by email

5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records

Thanks

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 4

raymond.wee.823 (9/2/2016)
Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:
1. Sort the whole list by name
2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)
3. Sort the whole list by phone
4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records
5. Sort the whole list by email
5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records
Thanks

Try the second method I posted, Raymond. In my test harness it runs through 20,000 rows in less than a second.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

raymond.wee.823 (9/2/2016)
Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:
1. Sort the whole list by name
2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)
3. Sort the whole list by phone
4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records
5. Sort the whole list by email
5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records
Thanks

If you break out those chained CTEs into #temp tables it takes 3s for 20,000 rows:

DROP TABLE #Firstpass

DROP TABLE #Secondpass

SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

Grp = CASE WHEN t1.ID < x.ID OR x.ID IS NULL THEN t1.ID ELSE x.ID END

INTO #Firstpass

FROM #Temp t1

OUTER APPLY (

SELECT ID = MIN(ID)

FROM #Temp t2

WHERE t2.ID < t1.ID AND t2.[Name] = t1.[Name]

) x

SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

INTO #Secondpass

FROM #Firstpass t1

OUTER APPLY (

SELECT Grp = MIN(Grp)

FROM #Firstpass t2

WHERE t2.Grp < t1.Grp AND (t2.Phone = t1.Phone)

) x

SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],

Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END

FROM #Secondpass t1

OUTER APPLY (

SELECT Grp = MIN(Grp)

FROM #Secondpass t2

WHERE t2.Grp < t1.Grp AND (t2.Email = t1.Email)

) x

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 6

What should the result be if a row matches two or more established groups? For example, there is a 'Smith' group and a 'Jones' group; then, you get a row that has the name Smith, but a phone number already associated with the Jones group.

Can a row belong to more than one group?

If not, is there a priority to the columns for assigning it to one group?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 7

The Dixie Flatline (9/2/2016)
What should the result be if a row matches two or more established groups? For example, there is a 'Smith' group and a 'Jones' group; then, you get a row that has the name Smith, but a phone number already associated with the Jones group.
Can a row belong to more than one group?
If not, is there a priority to the columns for assigning it to one group?

As I understand it Bob, the Smith group join the Jones group (or vice versa, depending on which has the lowest group number).

I've seen this requirement before - identification of "Family" or "Household" in marketing databases.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

The Dixie Flatline SSC Guru Points: 53354 More actions · Answer 8

So you have to try all possible matches and just take the one with the lowest group number?

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

The Dixie Flatline (9/5/2016)
So you have to try all possible matches and just take the one with the lowest group number?

That's a fair description of what the code does - but to make it clear, it matches each column sequentially.

Group by name, assign the lowest ID in the partition to each row within it (as grp1). Singletons will have grp1 equal to their ID.

Group by phone assigning the lowest [grp1] to each row in a partition as [grp2].

Group by email, assign the lowest [grp2] to each row in a partition, as [grp3], the final group.

Best demonstrated with some code:

DROP TABLE #Temp

SELECT ID, [Name], Phone, Email, [Group]

INTO #Temp

FROM (VALUES

(1, 'Alex', 123, 'abc@domain', 1),

(2, 'Alex', 456, 'def@domain', 1),

(3, 'Brian', 123, 'ghi@domain', 1),

(4, 'Carlos', 890, 'ghi@domain', 1),

(5, 'Diana', 777, 'hij@domain', 2),

(6, 'Dave', 123, 'hijkl@domain', 1),

(7, 'Robert', 123, 'xptl@domain', 1),

(8, 'Diane', 778, 'hij@domain', 2)

) d (ID, [Name], Phone, Email, [Group]);

WITH

Firstpass AS (SELECT ID, [Name], Phone, Email, [Group],

grp1 = MIN(ID) OVER(PARTITION BY [Name]) FROM #Temp),

Secondpass AS (SELECT ID, [Name], Phone, Email, [Group], grp1,

grp2 = MIN(grp1) OVER(PARTITION BY Phone) FROM Firstpass),

Thirdpass AS (SELECT ID, [Name], Phone, Email, [Group], grp1, grp2,

grp3 = MIN(grp2) OVER(PARTITION BY Email) FROM Secondpass)

SELECT ID, [Name], Phone, Email, [Group], grp1, grp2, grp3,

FinalGroup = DENSE_RANK() OVER(ORDER BY grp3) FROM Thirdpass

ORDER BY ID

Edit: this type of problem is frequently encountered (and hence solved) in marketing - I wonder if there's a name for it?

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden