Joining 2 Tables with Or condition

  • Hello,

    I want to join 2 tables, #test_table_1 with #test_table_2. Then I want to compare EmailAddress field on #test_table_1 with the following fields on #test_table_2, EmailAddress, Email1, and Email2.

    The join condition from #test_table_1 should compare against these 3 fields, and if any of the fields match, return only the matching row from #test_table_1

    So the result should be:

    EmailAddress FirstName LastName

    c@b.com Jack Quick

    Below is some sample data to assist. Naturally there will be many rows in each table.

    create table #test_table_1

    (EmailAddress varchar(100),FirstName varchar(100),LastName varchar(100))

    create table #test_table_2

    (EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100))

    insert into #test_table_1

    select 'c@b.com' as EmailAddress, 'Jack' as FirstName, 'Quick' as LastName

    insert into #test_table_2

    select 'a@b.com' as EmailAddress, 'b@b.com' as Email1, 'c@b.com' as Email2

    HTH. Thanks in advance..

  • create this query:

    SELECT EmailAddress, EMail1 AS Mail1

    FROM #Test_table_2

    UNION ALL

    SELECT EmailAddress, EMail2

    FROM #Test_table_2

    then join that to whatever other table you need.

    Normalization is a beautiful thing.

  • Sorry that doesn't really help me. Can you put the complete query please?

    Data won't be identical in both tables, that's why I'm trying to compare and identify when a match occurs.

  • Scratch all that.

    You have a basic normalization problem. If you had a single column/field in your table for e-mail address, then it would probably be a trivial question. So you have two options: either fix the table structure, or create a UNION query to return something that looks like a normal table structure. Then you can join your two tables on a single column and you're home free.

    Querying tables that aren't normalized is a serious nightmare. I would fix that first.

  • Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.

  • Quick suggestion, unpivot the second table and join the results to the first table.

    😎

    create table #test_table_2

    (EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100))

    INSERT INTO #test_table_1

    SELECT 'c@b.com' as EmailAddress, 'Jack' as FirstName, 'Quick' as LastName

    insert into #test_table_2

    select 'a@b.com' as EmailAddress, 'b@b.com' as Email1, 'c@b.com' as Email2

    SELECT

    X.CID

    ,X.EMAIL

    ,T1.FirstName

    ,T1.LastName

    FROM #test_table_2 T2

    CROSS APPLY

    (

    SELECT 1, T2.EmailAddress UNION ALL

    SELECT 2, T2.Email1 UNION ALL

    SELECT 3, T2.Email2

    )AS X(CID,EMAIL)

    INNER JOIN #test_table_1 T1

    ON X.EMAIL = T1.EmailAddress;

    Output

    CID EMAIL FirstName LastName

    ----- --------- ----------- ---------

    3 c@b.com Jack Quick

  • VegasL (1/9/2016)


    Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.

    If this is coming from some impoprted source, then you can normalize it. Just load the data as is in a staging table, then validate and massage until it is in the right shape. (Preferably load SSIS for this, that tool is much better equipped for the job).

    If the database design is fixed and you cannot influence it, then I pity you. For this specific problem you can probably use something liek the below (though I would not expect too much in the performance department);

    SELECT t1.EmailAdress, t1.FirstName, t1.LastName

    FROM #test_table1 AS t1

    WHERE EXISTS

    (SELECT *

    FROM #test_table2 AS t2

    WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

    But with tables designed this way, you can expect to keep running into problems that require clumsy workarounds in query, and that wreck performance.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (1/10/2016)


    VegasL (1/9/2016)


    Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.

    If this is coming from some impoprted source, then you can normalize it. Just load the data as is in a staging table, then validate and massage until it is in the right shape. (Preferably load SSIS for this, that tool is much better equipped for the job).

    If the database design is fixed and you cannot influence it, then I pity you. For this specific problem you can probably use something liek the below (though I would not expect too much in the performance department);

    SELECT t1.EmailAdress, t1.FirstName, t1.LastName

    FROM #test_table1 AS t1

    WHERE EXISTS

    (SELECT *

    FROM #test_table2 AS t2

    WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

    But with tables designed this way, you can expect to keep running into problems that require clumsy workarounds in query, and that wreck performance.

    Be very careful here as this will perform very badly, in fact it will scan the second table for every row in the first table. As it works fine with very small number of rows this has the potentials of being a database time bomb!

    😎

    To demonstrate the difference, here is a simple test harness

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'tempdb..#test_table_1') IS NOT NULL DROP TABLE #test_table_1;

    create table #test_table_1

    (X_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,EmailAddress varchar(100),FirstName varchar(100),LastName varchar(100));

    IF OBJECT_ID(N'tempdb..#test_table_2') IS NOT NULL DROP TABLE #test_table_2;

    create table #test_table_2

    (X_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100));

    DECLARE @SAMPLE_SIZE INT = 10000;

    DECLARE @VARIANCE INT = 5000;

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6)

    INSERT INTO #test_table_1 (EmailAddress,FirstName,LastName)

    SELECT

    CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

    ,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0)

    ,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0)

    FROM NUMS NM

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6)

    insert into #test_table_2 (EmailAddress,Email1,Email2)

    SELECT

    CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

    ,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

    ,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

    FROM NUMS NM;

    CREATE NONCLUSTERED INDEX NCLIDX_TMP_TEST_1_EMAIL_INCL_FIRST_LAST ON #test_table_1 (EmailAddress ASC) INCLUDE (FirstName,LastName);

    CREATE NONCLUSTERED INDEX NCLIDX_TMP_TEST_1_EMAIL_1 ON #test_table_2 (EmailAddress ASC) INCLUDE (Email1,Email2);

    DECLARE @timer TABLE (T_TEXT VARCHAR(20) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));

    DECLARE @VARCHAR_BUCKET_01 VARCHAR(100) = '';

    DECLARE @VARCHAR_BUCKET_02 VARCHAR(100) = '';

    DECLARE @VARCHAR_BUCKET_03 VARCHAR(100) = '';

    INSERT INTO @timer(T_TEXT) VALUES('UNPIVOT');

    SELECT

    @VARCHAR_BUCKET_01 = X.EMAIL

    ,@VARCHAR_BUCKET_02 = T1.FirstName

    ,@VARCHAR_BUCKET_03 = T1.LastName

    FROM #test_table_2 T2

    CROSS APPLY

    (

    SELECT T2.EmailAddress UNION ALL

    SELECT T2.Email1 UNION ALL

    SELECT T2.Email2

    )AS X(EMAIL)

    INNER JOIN #test_table_1 T1

    ON X.EMAIL = T1.EmailAddress;

    INSERT INTO @timer(T_TEXT) VALUES('UNPIVOT');

    INSERT INTO @timer(T_TEXT) VALUES('WHERE EXISTS');

    SELECT

    @VARCHAR_BUCKET_01 = t1.EmailAddress

    ,@VARCHAR_BUCKET_02 = t1.FirstName

    ,@VARCHAR_BUCKET_03 = t1.LastName

    FROM #test_table_1 AS t1

    WHERE EXISTS

    (SELECT *

    FROM #test_table_2 AS t2

    WHERE t1.EmailAddress IN (t2.EmailAddress, t2.Email1, t2.Email2));

    INSERT INTO @timer(T_TEXT) VALUES('WHERE EXISTS');

    SELECT

    T.T_TEXT

    ,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

    FROM @timer T

    GROUP BY T.T_TEXT

    ORDER BY DURATION;

    Results (2nd gen i5 laptop)

    T_TEXT DURATION

    -------------------- -----------

    UNPIVOT 53003

    WHERE EXISTS 4352242

  • I'm pretty sure I already mentioned that the performance will be dramatic.

    Also, if the table grows significantly I would index all three columns. I expect a plan with three seeks per row of the first table in that case. Still not well-performing, but that is almost always impossible to achieve when you have to work with unnormalized data.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Thank Eirikur Eiriksson & Hugo & everyone else..appreciate you're help. Performance at this stage is a non issue.

    Hugo, I got Incorrect syntax near ';' when I tried:

    SELECT t1.EmailAdress, t1.FirstName, t1.LastName

    FROM #test_table1 AS t1

    WHERE EXISTS

    (SELECT *

    FROM #test_table2 AS t2

    WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

  • Hugo Kornelis (1/10/2016)


    I'm pretty sure I already mentioned that the performance will be dramatic.

    Also, if the table grows significantly I would index all three columns. I expect a plan with three seeks per row of the first table in that case. Still not well-performing, but that is almost always impossible to achieve when you have to work with unnormalized data.

    You did mention the poor performance but I felt the warning wasn't strong enough.;-)

    😎

    The unpivot method on a set of 1,000,000 rows with a match ratio of 1:2 returned in less than 5 seconds, I killed the WHERE EXISTS after it had been running for 15 minutes.

    Adding a separate index on all three columns is unlikely to help as the server will most likely use a full scan on the second table and a lazy spool into the outer part of a nested loop join to the first table, as single compound index of all three columns is more likely to be used but doesn't help the performance though.

  • VegasL (1/10/2016)


    Thank Hugo & everyone else..appreciate you're help. Performance at this stage is a non issue.

    Hugo, I got Incorrect syntax near ';' when I tried:

    SELECT t1.EmailAdress, t1.FirstName, t1.LastName

    FROM #test_table1 AS t1

    WHERE EXISTS

    (SELECT *

    FROM #test_table2 AS t2

    WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

    DON'T USE THIS METHOD!

    😎

    Why use code that can and potentially will cause problems when you have the opportunity to do things properly in the first place? Just does not make sense.

  • VegasL (1/10/2016)


    Hugo, I got Incorrect syntax near ';' when I tried:

    The syntax error is a missing closing parenthesis.

    However, Eirikur is right: his method performs better (or probably I should say less bad) then the one I posted. I just did a few tests based on his feedback and test code and even adding the indexes I hoped would help did not help. At all.

    I posted my version because I think that that code is clearer and easier to understand and maintain, but I must now ask you to please only use it if you are very sure that the amount of data will never be more than a few handful of rows. If there is even the remotest chance that the tables will grow to a significant size, then please use the code Eirikur posted.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Piling on, if for some reasons you don't want to use the unpivoting method then using three separate WHERE IN clauses is a much better option, it will use indices on the three columns and it will only scan them once. Although it's slower than the unpivot by the factor of 2, it's still much much faster than the WHERE EXISTS method.

    😎

    SELECT

    t1.EmailAddress

    ,t1.FirstName

    ,t1.LastName

    FROM #test_table_1 AS t1

    WHERE EmailAddress IN

    (SELECT t2.EmailAddress

    FROM #test_table_2 AS t2)

    OR EmailAddress IN

    (SELECT t2.Email1

    FROM #test_table_2 AS t2)

    OR EmailAddress IN

    (SELECT t2.Email2

    FROM #test_table_2 AS t2);

  • Thanks Hugo & Eirikur Eiriksson again. The query below seems a bit easier to understand. Is there anyway to join tables T1 with T2, where EmailAddress from T1 is joined to either t2.emailaddress,t2.email1, or t2.email3 ? I want to be able to at same time put values in initial select query, what has values from t2, using the where in /or subquery logic you have below.

    SELECT

    t1.EmailAddress

    ,t1.FirstName

    ,t1.LastName

    FROM #test_table_1 AS t1

    WHERE EmailAddress IN

    (SELECT t2.EmailAddress

    FROM #test_table_2 AS t2)

    OR EmailAddress IN

    (SELECT t2.Email1

    FROM #test_table_2 AS t2)

    OR EmailAddress IN

    (SELECT t2.Email2

    FROM #test_table_2 AS t2);

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply