Linking rows based on matching fields

  • I have a sub-set of customers :

    CREATE TABLE #temptable ( [ID] int, [Mobile] nvarchar(20), [DOB] date, [Email] nvarchar(50), [PostCode] nvarchar(10) )
    INSERT INTO #temptable
    VALUES
    ( 1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT' ),
    ( 2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT' ),
    ( 3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER' ),
    ( 4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1' )

    I need to detect associations between customers - this is done by matching customers where any 2 fields of the 4 (Mobile, DOB, Email, PostCode) match.

    The record size is approx 2.5m.

    Everything I've tried to date is too slow, for instance self-joins on pairs of fields.   I've not tried cursors as think this will be a no no too. 

    The result I'm after is a table with 2 columns :   ID, RelatedID   (Each ID can have 1 or many relatedID's)

    For instance the result on the sample data would be :


    ID         RelatedID
    1           1
    1           2
    1           4
    2           1
    3           3
    4           1
    4           4

        
    I wondered what the optimal way of doing this is.  To be honest I'm beginning to think it's not even up SQL's street.

    I wondered if there's anything in TSQL2016 that comes in handy in this respect?

  • Well, I can get you down to:

    Scan count 12, logical reads 12, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Here's the query:

    CREATE TABLE #temptable (
        ID int,
        Mobile nvarchar(20),
        DOB date,
        Email nvarchar(50),
        PostCode nvarchar(10)
    );
    INSERT INTO #temptable (ID, Mobile, DOB, Email, PostCode)
    VALUES
            (1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT'),
            (2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT'),
            (3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER'),
            (4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1');

    SET STATISTICS IO ON;
    SET STATISTICS TIME ON;

    SELECT T.ID, T2.ID AS RelatedID, 'Mobile / DOB' AS RelatedFields
    FROM #temptable AS T
        INNER JOIN #temptable AS T2
            ON T.ID <> T2.ID
            AND T2.Mobile = T.Mobile
            AND T2.DOB = T.DOB
    UNION
    SELECT T.ID, T2.ID AS RelatedID, 'Mobile / Email' AS RelatedFields
    FROM #temptable AS T
        INNER JOIN #temptable AS T2
            ON T.ID <> T2.ID
            AND T2.Mobile = T.Mobile
            AND T2.Email = T.Email
    UNION
    SELECT T.ID, T2.ID AS RelatedID, 'Mobile / PostCode' AS RelatedFields
    FROM #temptable AS T
        INNER JOIN #temptable AS T2
            ON T.ID <> T2.ID
            AND T2.Mobile = T.Mobile
            AND T2.PostCode = T.PostCode
    UNION
    SELECT T.ID, T2.ID AS RelatedID, 'DOB / Email' AS RelatedFields
    FROM #temptable AS T
        INNER JOIN #temptable AS T2
            ON T.ID <> T2.ID
            AND T2.DOB = T.DOB
            AND T2.Email = T.Email
    UNION
    SELECT T.ID, T2.ID AS RelatedID, 'DOB / PostCpde' AS RelatedFields
    FROM #temptable AS T
        INNER JOIN #temptable AS T2
            ON T.ID <> T2.ID
            AND T2.DOB = T.DOB
            AND T2.PostCode = T.PostCode
    UNION
    SELECT T.ID, T2.ID AS RelatedID, 'Email / PostCode' AS RelatedFields
    FROM #temptable AS T
        INNER JOIN #temptable AS T2
            ON T.ID <> T2.ID
            AND T2.Email = T.Email
            AND T2.PostCode = T.PostCode
    ORDER BY ID, RelatedID;

    DROP TABLE #temptable;

    It covers all 6 possible pairs of 2 fields, which would appear to be the only necessary combinations.   The larger that number, the worse things get.  Let me know if that's any improvement over what you have.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Just another idea. Unpivot the columns you want to compare and count the number of ID pairs. At least it makes it easier to add more columns to compare.

    WITH ColumnValues AS
    (
      SELECT
       ID, ColumnID, ColumnValue
      FROM
       #temptable
      CROSS APPLY
      (
       VALUES (1, CAST(Mobile AS NVARCHAR(50))), (2, CAST(DOB AS NVARCHAR(50))), (3, Email), (4, CAST(PostCode AS NVARCHAR(50)))
      ) CA(ColumnID, ColumnValue)
    )
    SELECT
      CV1.ID, CV2.ID
    FROM
      ColumnValues CV1
    JOIN
      ColumnValues CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue
    GROUP BY
      CV1.ID, CV2.ID
    HAVING
      COUNT(*) > 1
    ORDER BY
      CV1.ID, CV2.ID;

    It's probably a good idea to persist the outcome of the CTE ColumnValues  in  a temporary table and put an index on it.

    SELECT
      ID, ColumnID, ColumnValue
    INTO
      #t
    FROM
      #temptable
    CROSS APPLY
    (
      VALUES (1, CAST(Mobile AS NVARCHAR(50))), (2, CAST(DOB AS NVARCHAR(50))), (3, Email), (4, CAST(PostCode AS NVARCHAR(50)))
    ) CA(ColumnID, ColumnValue);

    CREATE CLUSTERED INDEX PK_T ON #t(ColumnID, ColumnValue);

    SELECT
      CV1.ID, CV2.ID
    FROM
      #t CV1
    JOIN
      #t CV2 ON CV2.ColumnID = CV1.ColumnID AND CV2.ColumnValue = CV1.ColumnValue
    GROUP BY
      CV1.ID, CV2.ID
    HAVING
      COUNT(*) > 1
    ORDER BY
      CV1.ID, CV2.ID;

    P.S.  Since (1,1), (3,3) and (4,4) are part of the expected output, I guess (2,2) is missing in your sample output.

  • Nice query, Peter.   Only 2 scans and 2 logical reads.  Here's a slightly modified version that can also tell you what field pairs were involved:


    WITH ColumnValues AS (

        SELECT ID, ColumnID, ColumnName, ColumnValue
        FROM #temptable
            CROSS APPLY
                (
                VALUES    (1, 'Mobile', CAST(Mobile AS nvarchar(50))),
                        (2, 'DOB', CAST(DOB AS nvarchar(50))),
                        (3, 'Email', Email),
                        (4, 'PostCode', CAST(PostCode AS nvarchar(50)))
                ) AS CA(ColumnID, ColumnName, ColumnValue)
    )
    SELECT CV1.ID, CV2.ID AS RelatedID, MIN(CV1.ColumnName) + ' / ' + MAX(CV2.ColumnName) AS RelatedFields
    FROM ColumnValues AS CV1
        JOIN ColumnValues AS CV2
            ON CV2.ColumnID = CV1.ColumnID
            AND CV2.ColumnValue = CV1.ColumnValue
            AND CV2.ID <> CV1.ID
    GROUP BY CV1.ID, CV2.ID
    HAVING COUNT(*) > 1
    ORDER BY CV1.ID, CV2.ID;

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • With 4 rows as test input, the number of scans and logical reads are a little bit meaningless. The version with the persisted CTE ColumnValues resulted in 17 scans (1 table scan on #t and 16 index lookups). With 2.5m rows as the real input (resulting in 10m rows in #t), I expect two clustered index scans on #t and a merge join followed by a sort for the aggregation. Since the row size of #t is very small, I guess this might be a relatively fast solution. But again, just guessing.

  • I'll test queries and let you know the stats.

    Lee

  • Thanks for providing the solutions.  It's interesting to see how unions can be removed with cross join (good learning for the future).  sgmunson nice way to deliver the match type - thanks 🙂

    I persisted the CTE and added the clustered index (nice optimisation - I will note for the future).   With 2.5m rows the query was too expensive - taking over 2 hours before I killed it. This matches my earlier experience.

    There's been a change in requirements which let me reduce the data set to 65K rows.  After un-pivoting this grows to 260K.

    Running the equi-join was very fast with the smaller dataset.

    SQL Server parse and compile time:
     CPU time = 10 ms, elapsed time = 10 ms.
    Table '#t_'. Scan count 10, logical reads 4160, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Worktable'. Scan count 22076, logical reads 429364, physical reads 0, read-ahead reads 553, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    Table 'Workfile'. Scan count 20, logical reads 4984, physical reads 423, read-ahead reads 4561, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
     CPU time = 6155 ms, elapsed time = 2956 ms.Cheers.

  • leehbi - Friday, February 17, 2017 9:19 AM

    I have a sub-set of customers :

    CREATE TABLE #temptable ( [ID] int, [Mobile] nvarchar(20), [DOB] date, [Email] nvarchar(50), [PostCode] nvarchar(10) )
    INSERT INTO #temptable
    VALUES
    ( 1, N'078', N'1970-01-01T00:00:00', N'bob@bob.com', N'L1 3RT' ),
    ( 2, N'079', N'1981-01-02T00:00:00', N'bob@bob.com', N'L1 3RT' ),
    ( 3, N'034', N'1986-05-03T00:00:00', N'bert@aol.com', N'CH1 3ER' ),
    ( 4, N'078', N'1970-01-01T00:00:00', N'bob2@bob.com', N'L1 3RT1' )

    I need to detect associations between customers - this is done by matching customers where any 2 fields of the 4 (Mobile, DOB, Email, PostCode) match.

    The record size is approx 2.5m.

    Everything I've tried to date is too slow, for instance self-joins on pairs of fields.   I've not tried cursors as think this will be a no no too. 

    The result I'm after is a table with 2 columns :   ID, RelatedID   (Each ID can have 1 or many relatedID's)

    For instance the result on the sample data would be :


    ID         RelatedID
    1           1
    1           2
    1           4
    2           1
    3           3
    4           1
    4           4

        
    I wondered what the optimal way of doing this is.  To be honest I'm beginning to think it's not even up SQL's street.

    I wondered if there's anything in TSQL2016 that comes in handy in this respect?

    Beforeyou post again, please read a book on basic RDBMS, so you want postgarbage. A table has to have a key, but there is no such thing as ageneric “id†in RDBMS; it has to be the identifier of somethingin particular. The ANSI ISO standard SQL does not recognize the “Tâ€separator in temporal data (frankly, I think that is a flaw and Ishould of voted against it). You do know that identifier can never   numeric? You do not do any math on identifiers. I hope 🙁
     CREATETABLE Customers
    (mobile_somethingCHAR(3) NOT NULL,
    birth_dateDATE NOT NULL,
    Customers_emailCHAR(256) NOT NULL PRIMARY KEY,
    post_codeVARCHAR(10)NOT NULL);

    the first week of your database class? The definition of a primary key?It has to be made up of attributes in the table and not a physicalcount of things in the storage of the table. We also have a DATE datatype in SQL that you failed to use in the insertion statement. Hereis an attempt correct what you did

    INSERT INTO Customers
    VALUES
    ('078','1970-01-01', 'bob@bob.com', 'L1 3RT'),
    ('079','1981-01-02', 'bob@bob.com', 'L1 3RT'), – error
    ('034','1986-05-03', 'bert@aol.com', 'CH1 3ER'),

    ('078','1970-01-01', 'bob2@bob.com', 'L1 3RT1');

    the two lines of the text of the insertion statement are wrong;they have a duplicate primary key, which is the email address. This should have been handled in an update of some kind to could change the birthdate. We really cannot fix the fact that you have a bad system with the improper performance.
    >>I need to detect associations between customers - this is done bymatching customers where any 2 fields of [sic] the 4 (Mobile,birth_date, Email, PostCode) match. <<
    do you understand the fundamental concept of RDBMS? Rows in the tablehave a key which makes them unique. You should have been scrubbing your data before you put it in the table to guarantee that you have actual real records. What you are doing is a way we used to do with Mag tapes and punchcards; throw a lot of bad data in a file sortedand then scan for payers a possible duplicates. We do not do that anymore.

    >>The record [sic: rows are not records!] size is approx 2.5m. <<
    please learn the difference between rows and records is one of your major problemsalso, what do you mean the size of a row is 2.5 m?
    >>Everything I've tried to date is too slow, for instance self-joins on pairs of fields [sic]. <<

    you do you understand the problem is that you are trying to clean up the mess you make after you make it? Do not let the bad data get in in thefirst place.

    >>I've not tried cursors as think this will be a no no too. <<
    oh yes! This like walking on to a vegan website and saying that you want to eat fried babies. :crying::crying:

    Please post DDL and follow ANSI/ISO standards when asking for help. 

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply