Linking rows based on matching fields

Question

Linking rows based on matching fields

leehbi

SSCrazy

Points: 2874
More actions
February 17, 2017 at 9:19 am

#320462

I have a sub-set of customers :
CREATE TABLE #temptable ( [ID] int, [Mobile] nvarchar(20), [DOB] date, [Email] nvarchar(50), [PostCode] nvarchar(10) ) INSERT INTO #temptable VALUES ( 1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT' ), ( 2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT' ), ( 3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER' ), ( 4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1' )
I need to detect associations between customers - this is done by matching customers where any 2 fields of the 4 (Mobile, DOB, Email, PostCode) match.
The record size is approx 2.5m.
Everything I've tried to date is too slow, for instance self-joins on pairs of fields. I've not tried cursors as think this will be a no no too.
The result I'm after is a table with 2 columns : ID, RelatedID (Each ID can have 1 or many relatedID's)
For instance the result on the sample data would be :
ID RelatedID 1 1 1 2 1 4 2 1 3 3 4 1 4 4

I wondered what the optimal way of doing this is. To be honest I'm beginning to think it's not even up SQL's street.
I wondered if there's anything in TSQL2016 that comes in handy in this respect?

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

sgmunson SSC Guru Points: 110639 More actions · Answer 1

Well, I can get you down to:
Scan count 12, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Here's the query:
CREATE TABLE #temptable ( ID int, Mobile nvarchar(20), DOB date, Email nvarchar(50), PostCode nvarchar(10) ); INSERT INTO #temptable (ID, Mobile, DOB, Email, PostCode) VALUES (1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT'), (2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT'), (3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER'), (4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1');

SET STATISTICS IO ON; SET STATISTICS TIME ON;

SELECT T.ID, T2.ID AS RelatedID, 'Mobile / DOB' AS RelatedFields FROM #temptable AS T INNER JOIN #temptable AS T2 ON T.ID <> T2.ID AND T2.Mobile = T.Mobile AND T2.DOB = T.DOB UNION SELECT T.ID, T2.ID AS RelatedID, 'Mobile / Email' AS RelatedFields FROM #temptable AS T INNER JOIN #temptable AS T2 ON T.ID <> T2.ID AND T2.Mobile = T.Mobile AND T2.Email = T.Email UNION SELECT T.ID, T2.ID AS RelatedID, 'Mobile / PostCode' AS RelatedFields FROM #temptable AS T INNER JOIN #temptable AS T2 ON T.ID <> T2.ID AND T2.Mobile = T.Mobile AND T2.PostCode = T.PostCode UNION SELECT T.ID, T2.ID AS RelatedID, 'DOB / Email' AS RelatedFields FROM #temptable AS T INNER JOIN #temptable AS T2 ON T.ID <> T2.ID AND T2.DOB = T.DOB AND T2.Email = T.Email UNION SELECT T.ID, T2.ID AS RelatedID, 'DOB / PostCpde' AS RelatedFields FROM #temptable AS T INNER JOIN #temptable AS T2 ON T.ID <> T2.ID AND T2.DOB = T.DOB AND T2.PostCode = T.PostCode UNION SELECT T.ID, T2.ID AS RelatedID, 'Email / PostCode' AS RelatedFields FROM #temptable AS T INNER JOIN #temptable AS T2 ON T.ID <> T2.ID AND T2.Email = T.Email AND T2.PostCode = T.PostCode ORDER BY ID, RelatedID;

DROP TABLE #temptable;

It covers all 6 possible pairs of 2 fields, which would appear to be the only necessary combinations. The larger that number, the worse things get. Let me know if that's any improvement over what you have.

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

matzer SSCrazy Eights Points: 9727 More actions · Answer 2

Just another idea. Unpivot the columns you want to compare and count the number of ID pairs. At least it makes it easier to add more columns to compare.
WITH ColumnValues AS ( SELECT ID, ColumnID, ColumnValue FROM #temptable CROSS APPLY ( VALUES (1, CAST(Mobile AS NVARCHAR(50))), (2, CAST(DOB AS NVARCHAR(50))), (3, Email), (4, CAST(PostCode AS NVARCHAR(50))) ) CA(ColumnID, ColumnValue) ) SELECT CV1.ID, CV2.ID FROM ColumnValues CV1 JOIN ColumnValues CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue GROUP BY CV1.ID, CV2.ID HAVING COUNT(*) > 1 ORDER BY CV1.ID, CV2.ID;
It's probably a good idea to persist the outcome of the CTE ColumnValues in a temporary table and put an index on it.
SELECT ID, ColumnID, ColumnValue INTO #t FROM #temptable CROSS APPLY ( VALUES (1, CAST(Mobile AS NVARCHAR(50))), (2, CAST(DOB AS NVARCHAR(50))), (3, Email), (4, CAST(PostCode AS NVARCHAR(50))) ) CA(ColumnID, ColumnValue);

CREATE CLUSTERED INDEX PK_T ON #t(ColumnID, ColumnValue);

SELECT CV1.ID, CV2.ID FROM #t CV1 JOIN #t CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue GROUP BY CV1.ID, CV2.ID HAVING COUNT(*) > 1 ORDER BY CV1.ID, CV2.ID;

P.S. Since (1,1), (3,3) and (4,4) are part of the expected output, I guess (2,2) is missing in your sample output.

sgmunson SSC Guru Points: 110639 More actions · Answer 3

Nice query, Peter. Only 2 scans and 2 logical reads. Here's a slightly modified version that can also tell you what field pairs were involved:

WITH ColumnValues AS (

SELECT ID, ColumnID, ColumnName, ColumnValue FROM #temptable CROSS APPLY ( VALUES (1, 'Mobile', CAST(Mobile AS nvarchar(50))), (2, 'DOB', CAST(DOB AS nvarchar(50))), (3, 'Email', Email), (4, 'PostCode', CAST(PostCode AS nvarchar(50))) ) AS CA(ColumnID, ColumnName, ColumnValue) ) SELECT CV1.ID, CV2.ID AS RelatedID, MIN(CV1.ColumnName) + ' / ' + MAX(CV2.ColumnName) AS RelatedFields FROM ColumnValues AS CV1 JOIN ColumnValues AS CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue AND CV2.ID <> CV1.ID GROUP BY CV1.ID, CV2.ID HAVING COUNT(*) > 1 ORDER BY CV1.ID, CV2.ID;

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

matzer SSCrazy Eights Points: 9727 More actions · Answer 4

With 4 rows as test input, the number of scans and logical reads are a little bit meaningless. The version with the persisted CTE ColumnValues resulted in 17 scans (1 table scan on #t and 16 index lookups). With 2.5m rows as the real input (resulting in 10m rows in #t), I expect two clustered index scans on #t and a merge join followed by a sort for the aggregation. Since the row size of #t is very small, I guess this might be a relatively fast solution. But again, just guessing.

leehbi SSCrazy Points: 2874 More actions · Answer 5

leehbi

SSCrazy

Points: 2874

February 20, 2017 at 2:38 am

#1929318

I'll test queries and let you know the stats.

Lee

leehbi SSCrazy Points: 2874 More actions · Answer 6

Thanks for providing the solutions. It's interesting to see how unions can be removed with cross join (good learning for the future). sgmunson nice way to deliver the match type - thanks 🙂

I persisted the CTE and added the clustered index (nice optimisation - I will note for the future). With 2.5m rows the query was too expensive - taking over 2 hours before I killed it. This matches my earlier experience.

There's been a change in requirements which let me reduce the data set to 65K rows. After un-pivoting this grows to 260K.

Running the equi-join was very fast with the smaller dataset.

SQL Server parse and compile time:
CPU time = 10 ms, elapsed time = 10 ms.
Table '#t_'. Scan count 10, logical reads 4160, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 22076, logical reads 429364, physical reads 0, read-ahead reads 553, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Workfile'. Scan count 20, logical reads 4984, physical reads 423, read-ahead reads 4561, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 6155 ms, elapsed time = 2956 ms.Cheers.

jcelko212 32090 SSCrazy Eights Points: 9294 More actions · Answer 7

leehbi - Friday, February 17, 2017 9:19 AM
I have a sub-set of customers :
CREATE TABLE #temptable ( [ID] int, [Mobile] nvarchar(20), [DOB] date, [Email] nvarchar(50), [PostCode] nvarchar(10) ) INSERT INTO #temptable VALUES ( 1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT' ), ( 2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT' ), ( 3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER' ), ( 4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1' )
I need to detect associations between customers - this is done by matching customers where any 2 fields of the 4 (Mobile, DOB, Email, PostCode) match.
The record size is approx 2.5m.
Everything I've tried to date is too slow, for instance self-joins on pairs of fields. I've not tried cursors as think this will be a no no too.
The result I'm after is a table with 2 columns : ID, RelatedID (Each ID can have 1 or many relatedID's)
For instance the result on the sample data would be :
ID RelatedID 1 1 1 2 1 4 2 1 3 3 4 1 4 4

I wondered what the optimal way of doing this is. To be honest I'm beginning to think it's not even up SQL's street.
I wondered if there's anything in TSQL2016 that comes in handy in this respect?

Beforeyou post again, please read a book on basic RDBMS, so you want postgarbage. A table has to have a key, but there is no such thing as ageneric â€œidâ€ in RDBMS; it has to be the identifier of somethingin particular. The ANSI ISO standard SQL does not recognize the â€œTâ€separator in temporal data (frankly, I think that is a flaw and Ishould of voted against it). You do know that identifier can never numeric? You do not do any math on identifiers. I hope 🙁
CREATETABLE Customers
(mobile_somethingCHAR(3) NOT NULL,
birth_dateDATE NOT NULL,
Customers_emailCHAR(256) NOT NULL PRIMARY KEY,
post_codeVARCHAR(10)NOT NULL);

the first week of your database class? The definition of a primary key?It has to be made up of attributes in the table and not a physicalcount of things in the storage of the table. We also have a DATE datatype in SQL that you failed to use in the insertion statement. Hereis an attempt correct what you did

INSERT INTO Customers
VALUES
('078','1970-01-01', 'bob@bob.com', 'L1 3RT'),
('079','1981-01-02', 'bob@bob.com', 'L1 3RT'), â€“ error
('034','1986-05-03', 'bert@aol.com', 'CH1 3ER'),

('078','1970-01-01', 'bob2@bob.com', 'L1 3RT1');

the two lines of the text of the insertion statement are wrong;they have a duplicate primary key, which is the email address. This should have been handled in an update of some kind to could change the birthdate. We really cannot fix the fact that you have a bad system with the improper performance.
>>I need to detect associations between customers - this is done bymatching customers where any 2 fields of [sic] the 4 (Mobile,birth_date, Email, PostCode) match. <<
do you understand the fundamental concept of RDBMS? Rows in the tablehave a key which makes them unique. You should have been scrubbing your data before you put it in the table to guarantee that you have actual real records. What you are doing is a way we used to do with Mag tapes and punchcards; throw a lot of bad data in a file sortedand then scan for payers a possible duplicates. We do not do that anymore.

>>The record [sic: rows are not records!] size is approx 2.5m. <<
please learn the difference between rows and records is one of your major problemsalso, what do you mean the size of a row is 2.5 m?
>>Everything I've tried to date is too slow, for instance self-joins on pairs of fields [sic]. <<

you do you understand the problem is that you are trying to clean up the mess you make after you make it? Do not let the bad data get in in thefirst place.

>>I've not tried cursors as think this will be a no no too. <<
oh yes! This like walking on to a vegan website and saying that you want to eat fried babies. :crying::crying:

Please post DDL and follow ANSI/ISO standards when asking for help.