September 1, 2016 at 11:48 pm
Hi,
I'm stumped at how to create a stored procedure that would group and identify contact information.
I have a table with the following columns
Name
Phone
Group
Example of desired output
Name Phone Email Group
Alex 123 abc@domain 1
Alex 456 def@domain 1
Brian 123 ghi@domain 1
Carlos 890 ghi@domain 1
Diana 777 hij@domain 2
first and second row belong to same group because they have similar name
third row is part of first group because phone number matches one of the phones of the first contact
fourth row is part of first group because email matches the email address in row 3
fifth row is part of a new group because it doesn't match any of the columns
September 2, 2016 at 3:58 am
-- Set up some sample data to test against
-- This is not part of the solution
-- Note: the ID column is required, you could generate it on-the-fly using ROW_NUMBER()
SELECT ID, [Name], Phone, Email, [Group]
INTO #Temp
FROM (VALUES
(1, 'Alex', 123, 'abc@domain', 1),
(2, 'Alex', 456, 'def@domain', 1),
(3, 'Brian', 123, 'ghi@domain', 1),
(4, 'Carlos', 890, 'ghi@domain', 1),
(5, 'Diana', 777, 'hij@domain', 2)
) d (ID, [Name], Phone, Email, [Group])
-- This is a solution
-- You will have to renumber the Grp column using DENSE_RANK.
;WITH Firstpass AS (
SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],
Grp = CASE WHEN t1.ID < x.ID OR x.ID IS NULL THEN t1.ID ELSE x.ID END
FROM #Temp t1
OUTER APPLY (
SELECT ID = MIN(ID)
FROM #Temp t2
WHERE t2.ID < t1.ID AND t2.[Name] = t1.[Name]
) x
),
Secondpass AS (
SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],
Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END
FROM Firstpass t1
OUTER APPLY (
SELECT Grp = MIN(Grp)
FROM Firstpass t2
WHERE t2.Grp < t1.Grp AND (t2.Phone = t1.Phone)
) x
)
SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],
Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END
FROM Secondpass t1
OUTER APPLY (
SELECT Grp = MIN(Grp)
FROM Secondpass t2
WHERE t2.Grp < t1.Grp AND (t2.Email = t1.Email)
) x
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 2, 2016 at 4:51 am
Here's a slightly different version, replacing table reads with table spools:
WITH
Firstpass AS (SELECT ID, [Name], Phone, Email, [Group],
grp1 = MIN(ID) OVER(PARTITION BY [Name]) FROM #Temp),
Secondpass AS (SELECT ID, [Name], Phone, Email, [Group],
grp2 = MIN(grp1) OVER(PARTITION BY Phone) FROM Firstpass),
Thirdpass AS (SELECT ID, [Name], Phone, Email, [Group],
grp3 = MIN(grp2) OVER(PARTITION BY Email) FROM Secondpass)
SELECT ID, [Name], Phone, Email, [Group],
FinalGroup = DENSE_RANK() OVER(ORDER BY grp3) FROM Thirdpass
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 2, 2016 at 5:01 am
Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:
1. Sort the whole list by name
2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)
3. Sort the whole list by phone
4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records
5. Sort the whole list by email
5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records
Thanks
September 2, 2016 at 5:24 am
raymond.wee.823 (9/2/2016)
Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:1. Sort the whole list by name
2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)
3. Sort the whole list by phone
4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records
5. Sort the whole list by email
5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records
Thanks
Try the second method I posted, Raymond. In my test harness it runs through 20,000 rows in less than a second.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 2, 2016 at 5:40 am
raymond.wee.823 (9/2/2016)
Hi, it looked promising. However, I tried it with 10000 records and it took 10 minutes just to go through the first 1000 records so it is a performance issue. I have a solution in mind however just not sure how to put it in stored procedure. Here goes in pseudocode:1. Sort the whole list by name
2. ignoring blanks, set the group Id for matching names (this should result into same names grouped together)
3. Sort the whole list by phone
4. ignoring blank or 0 for phone, if the group id for the records with matching phone are all 0, use a new group id, otherwise set all group ids to min(group id) of the matching records
5. Sort the whole list by email
5. ignoring blank entries for email address, if the group id for the records with matching phone are all 0s, use a new group id, otherwise set all group ids to min(group id) of the matching records
Thanks
If you break out those chained CTEs into #temp tables it takes 3s for 20,000 rows:
DROP TABLE #Firstpass
DROP TABLE #Secondpass
SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],
Grp = CASE WHEN t1.ID < x.ID OR x.ID IS NULL THEN t1.ID ELSE x.ID END
INTO #Firstpass
FROM #Temp t1
OUTER APPLY (
SELECT ID = MIN(ID)
FROM #Temp t2
WHERE t2.ID < t1.ID AND t2.[Name] = t1.[Name]
) x
SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],
Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END
INTO #Secondpass
FROM #Firstpass t1
OUTER APPLY (
SELECT Grp = MIN(Grp)
FROM #Firstpass t2
WHERE t2.Grp < t1.Grp AND (t2.Phone = t1.Phone)
) x
SELECT t1.ID, t1.[Name], t1.Phone, t1.Email, t1.[Group],
Grp = CASE WHEN t1.Grp < x.Grp OR x.Grp IS NULL THEN t1.Grp ELSE x.Grp END
FROM #Secondpass t1
OUTER APPLY (
SELECT Grp = MIN(Grp)
FROM #Secondpass t2
WHERE t2.Grp < t1.Grp AND (t2.Email = t1.Email)
) x
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 2, 2016 at 6:22 am
What should the result be if a row matches two or more established groups? For example, there is a 'Smith' group and a 'Jones' group; then, you get a row that has the name Smith, but a phone number already associated with the Jones group.
Can a row belong to more than one group?
If not, is there a priority to the columns for assigning it to one group?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
September 5, 2016 at 1:48 am
The Dixie Flatline (9/2/2016)
What should the result be if a row matches two or more established groups? For example, there is a 'Smith' group and a 'Jones' group; then, you get a row that has the name Smith, but a phone number already associated with the Jones group.Can a row belong to more than one group?
If not, is there a priority to the columns for assigning it to one group?
As I understand it Bob, the Smith group join the Jones group (or vice versa, depending on which has the lowest group number).
I've seen this requirement before - identification of "Family" or "Household" in marketing databases.
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
September 5, 2016 at 1:02 pm
So you have to try all possible matches and just take the one with the lowest group number?
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
September 6, 2016 at 1:44 am
The Dixie Flatline (9/5/2016)
So you have to try all possible matches and just take the one with the lowest group number?
That's a fair description of what the code does - but to make it clear, it matches each column sequentially.
Group by name, assign the lowest ID in the partition to each row within it (as grp1). Singletons will have grp1 equal to their ID.
Group by phone assigning the lowest [grp1] to each row in a partition as [grp2].
Group by email, assign the lowest [grp2] to each row in a partition, as [grp3], the final group.
Best demonstrated with some code:
DROP TABLE #Temp
SELECT ID, [Name], Phone, Email, [Group]
INTO #Temp
FROM (VALUES
(1, 'Alex', 123, 'abc@domain', 1),
(2, 'Alex', 456, 'def@domain', 1),
(3, 'Brian', 123, 'ghi@domain', 1),
(4, 'Carlos', 890, 'ghi@domain', 1),
(5, 'Diana', 777, 'hij@domain', 2),
(6, 'Dave', 123, 'hijkl@domain', 1),
(7, 'Robert', 123, 'xptl@domain', 1),
(8, 'Diane', 778, 'hij@domain', 2)
) d (ID, [Name], Phone, Email, [Group]);
WITH
Firstpass AS (SELECT ID, [Name], Phone, Email, [Group],
grp1 = MIN(ID) OVER(PARTITION BY [Name]) FROM #Temp),
Secondpass AS (SELECT ID, [Name], Phone, Email, [Group], grp1,
grp2 = MIN(grp1) OVER(PARTITION BY Phone) FROM Firstpass),
Thirdpass AS (SELECT ID, [Name], Phone, Email, [Group], grp1, grp2,
grp3 = MIN(grp2) OVER(PARTITION BY Email) FROM Secondpass)
SELECT ID, [Name], Phone, Email, [Group], grp1, grp2, grp3,
FinalGroup = DENSE_RANK() OVER(ORDER BY grp3) FROM Thirdpass
ORDER BY ID
Edit: this type of problem is frequently encountered (and hence solved) in marketing - I wonder if there's a name for it?
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 10 posts - 1 through 10 (of 10 total)
You must be logged in to reply to this topic. Login to reply