Joining 2 Tables with Or condition

Question

Post reply

Joining 2 Tables with Or condition

VegasL

Hall of Fame

Points: 3006
More actions
January 9, 2016 at 6:32 pm

#303621

Hello,
I want to join 2 tables, #test_table_1 with #test_table_2. Then I want to compare EmailAddress field on #test_table_1 with the following fields on #test_table_2, EmailAddress, Email1, and Email2.
The join condition from #test_table_1 should compare against these 3 fields, and if any of the fields match, return only the matching row from #test_table_1
So the result should be:
EmailAddress FirstName LastName
c@b.com Jack Quick
Below is some sample data to assist. Naturally there will be many rows in each table.
create table #test_table_1
(EmailAddress varchar(100),FirstName varchar(100),LastName varchar(100))
create table #test_table_2
(EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100))
insert into #test_table_1
select 'c@b.com' as EmailAddress, 'Jack' as FirstName, 'Quick' as LastName
insert into #test_table_2
select 'a@b.com' as EmailAddress, 'b@b.com' as Email1, 'c@b.com' as Email2
HTH. Thanks in advance..

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

pietlinden SSC Guru Points: 63421 More actions · Answer 1

create this query:

SELECT EmailAddress, EMail1 AS Mail1

FROM #Test_table_2

UNION ALL

SELECT EmailAddress, EMail2

FROM #Test_table_2

then join that to whatever other table you need.

Normalization is a beautiful thing.

VegasL Hall of Fame Points: 3006 More actions · Answer 2

Sorry that doesn't really help me. Can you put the complete query please?

Data won't be identical in both tables, that's why I'm trying to compare and identify when a match occurs.

pietlinden SSC Guru Points: 63421 More actions · Answer 3

Scratch all that.

You have a basic normalization problem. If you had a single column/field in your table for e-mail address, then it would probably be a trivial question. So you have two options: either fix the table structure, or create a UNION query to return something that looks like a normal table structure. Then you can join your two tables on a single column and you're home free.

Querying tables that aren't normalized is a serious nightmare. I would fix that first.

VegasL Hall of Fame Points: 3006 More actions · Answer 4

Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 5

Quick suggestion, unpivot the second table and join the results to the first table.

😎

create table #test_table_2

(EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100))

INSERT INTO #test_table_1

SELECT 'c@b.com' as EmailAddress, 'Jack' as FirstName, 'Quick' as LastName

insert into #test_table_2

select 'a@b.com' as EmailAddress, 'b@b.com' as Email1, 'c@b.com' as Email2

SELECT

X.CID

,X.EMAIL

,T1.FirstName

,T1.LastName

FROM #test_table_2 T2

CROSS APPLY

(

SELECT 1, T2.EmailAddress UNION ALL

SELECT 2, T2.Email1 UNION ALL

SELECT 3, T2.Email2

)AS X(CID,EMAIL)

INNER JOIN #test_table_1 T1

ON X.EMAIL = T1.EmailAddress;

Output

CID EMAIL FirstName LastName

----- --------- ----------- ---------

3 c@b.com Jack Quick

Hugo Kornelis SSC Guru Points: 64780 More actions · Answer 6

VegasL (1/9/2016)
Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.

If this is coming from some impoprted source, then you can normalize it. Just load the data as is in a staging table, then validate and massage until it is in the right shape. (Preferably load SSIS for this, that tool is much better equipped for the job).

If the database design is fixed and you cannot influence it, then I pity you. For this specific problem you can probably use something liek the below (though I would not expect too much in the performance department);

SELECT t1.EmailAdress, t1.FirstName, t1.LastName

FROM #test_table1 AS t1

WHERE EXISTS

(SELECT *

FROM #test_table2 AS t2

WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

But with tables designed this way, you can expect to keep running into problems that require clumsy workarounds in query, and that wreck performance.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 7

Hugo Kornelis (1/10/2016)
VegasL (1/9/2016)
Data comes that way. I can't normalize. I think this is analogous to an index match function from xls, which comepares one column against the other 3 columns then return matching result. Need to find way to do in sql.
If this is coming from some impoprted source, then you can normalize it. Just load the data as is in a staging table, then validate and massage until it is in the right shape. (Preferably load SSIS for this, that tool is much better equipped for the job).
If the database design is fixed and you cannot influence it, then I pity you. For this specific problem you can probably use something liek the below (though I would not expect too much in the performance department);
SELECT t1.EmailAdress, t1.FirstName, t1.LastName
FROM #test_table1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table2 AS t2
WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);
But with tables designed this way, you can expect to keep running into problems that require clumsy workarounds in query, and that wreck performance.

Be very careful here as this will perform very badly, in fact it will scan the second table for every row in the first table. As it works fine with very small number of rows this has the potentials of being a database time bomb!

😎

To demonstrate the difference, here is a simple test harness

USE tempdb;

GO

SET NOCOUNT ON;

IF OBJECT_ID(N'tempdb..#test_table_1') IS NOT NULL DROP TABLE #test_table_1;

create table #test_table_1

(X_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,EmailAddress varchar(100),FirstName varchar(100),LastName varchar(100));

IF OBJECT_ID(N'tempdb..#test_table_2') IS NOT NULL DROP TABLE #test_table_2;

create table #test_table_2

(X_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,EmailAddress varchar(100),Email1 varchar(100),Email2 varchar(100));

DECLARE @SAMPLE_SIZE INT = 10000;

DECLARE @VARIANCE INT = 5000;

;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6)

INSERT INTO #test_table_1 (EmailAddress,FirstName,LastName)

SELECT

CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0)

FROM NUMS NM

;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

, NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1 AS N

FROM T T1,T T2,T T3,T T4,T T5,T T6)

insert into #test_table_2 (EmailAddress,Email1,Email2)

SELECT

CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

,CONVERT(VARCHAR(10),ABS(CHECKSUM(NEWID())) % @VARIANCE,0) + '.com'

FROM NUMS NM;

CREATE NONCLUSTERED INDEX NCLIDX_TMP_TEST_1_EMAIL_INCL_FIRST_LAST ON #test_table_1 (EmailAddress ASC) INCLUDE (FirstName,LastName);

CREATE NONCLUSTERED INDEX NCLIDX_TMP_TEST_1_EMAIL_1 ON #test_table_2 (EmailAddress ASC) INCLUDE (Email1,Email2);

DECLARE @timer TABLE (T_TEXT VARCHAR(20) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));

DECLARE @VARCHAR_BUCKET_01 VARCHAR(100) = '';

DECLARE @VARCHAR_BUCKET_02 VARCHAR(100) = '';

DECLARE @VARCHAR_BUCKET_03 VARCHAR(100) = '';

INSERT INTO @timer(T_TEXT) VALUES('UNPIVOT');

SELECT

@VARCHAR_BUCKET_01 = X.EMAIL

,@VARCHAR_BUCKET_02 = T1.FirstName

,@VARCHAR_BUCKET_03 = T1.LastName

FROM #test_table_2 T2

CROSS APPLY

(

SELECT T2.EmailAddress UNION ALL

SELECT T2.Email1 UNION ALL

SELECT T2.Email2

)AS X(EMAIL)

INNER JOIN #test_table_1 T1

ON X.EMAIL = T1.EmailAddress;

INSERT INTO @timer(T_TEXT) VALUES('UNPIVOT');

INSERT INTO @timer(T_TEXT) VALUES('WHERE EXISTS');

SELECT

@VARCHAR_BUCKET_01 = t1.EmailAddress

,@VARCHAR_BUCKET_02 = t1.FirstName

,@VARCHAR_BUCKET_03 = t1.LastName

FROM #test_table_1 AS t1

WHERE EXISTS

(SELECT *

FROM #test_table_2 AS t2

WHERE t1.EmailAddress IN (t2.EmailAddress, t2.Email1, t2.Email2));

INSERT INTO @timer(T_TEXT) VALUES('WHERE EXISTS');

SELECT

T.T_TEXT

,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

FROM @timer T

GROUP BY T.T_TEXT

ORDER BY DURATION;

Results (2nd gen i5 laptop)

T_TEXT DURATION

-------------------- -----------

UNPIVOT 53003

WHERE EXISTS 4352242

Hugo Kornelis SSC Guru Points: 64780 More actions · Answer 8

I'm pretty sure I already mentioned that the performance will be dramatic.

Also, if the table grows significantly I would index all three columns. I expect a plan with three seeks per row of the first table in that case. Still not well-performing, but that is almost always impossible to achieve when you have to work with unnormalized data.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

VegasL Hall of Fame Points: 3006 More actions · Answer 9

Thank Eirikur Eiriksson & Hugo & everyone else..appreciate you're help. Performance at this stage is a non issue.

Hugo, I got Incorrect syntax near ';' when I tried:

SELECT t1.EmailAdress, t1.FirstName, t1.LastName

FROM #test_table1 AS t1

WHERE EXISTS

(SELECT *

FROM #test_table2 AS t2

WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 10

Hugo Kornelis (1/10/2016)
I'm pretty sure I already mentioned that the performance will be dramatic.
Also, if the table grows significantly I would index all three columns. I expect a plan with three seeks per row of the first table in that case. Still not well-performing, but that is almost always impossible to achieve when you have to work with unnormalized data.

You did mention the poor performance but I felt the warning wasn't strong enough.;-)

😎

The unpivot method on a set of 1,000,000 rows with a match ratio of 1:2 returned in less than 5 seconds, I killed the WHERE EXISTS after it had been running for 15 minutes.

Adding a separate index on all three columns is unlikely to help as the server will most likely use a full scan on the second table and a lazy spool into the outer part of a nested loop join to the first table, as single compound index of all three columns is more likely to be used but doesn't help the performance though.

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 11

VegasL (1/10/2016)
Thank Hugo & everyone else..appreciate you're help. Performance at this stage is a non issue.
Hugo, I got Incorrect syntax near ';' when I tried:
SELECT t1.EmailAdress, t1.FirstName, t1.LastName
FROM #test_table1 AS t1
WHERE EXISTS
(SELECT *
FROM #test_table2 AS t2
WHERE t1.EmailAdress IN (t2.EmailAddress, t2.Email1, t2.Email2);

DON'T USE THIS METHOD!

😎

Why use code that can and potentially will cause problems when you have the opportunity to do things properly in the first place? Just does not make sense.

Hugo Kornelis SSC Guru Points: 64780 More actions · Answer 12

VegasL (1/10/2016)
Hugo, I got Incorrect syntax near ';' when I tried:

The syntax error is a missing closing parenthesis.

However, Eirikur is right: his method performs better (or probably I should say less bad) then the one I posted. I just did a few tests based on his feedback and test code and even adding the indexes I hoped would help did not help. At all.

I posted my version because I think that that code is clearer and easier to understand and maintain, but I must now ask you to please only use it if you are very sure that the amount of data will never be more than a few handful of rows. If there is even the remotest chance that the tables will grow to a significant size, then please use the code Eirikur posted.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 13

Piling on, if for some reasons you don't want to use the unpivoting method then using three separate WHERE IN clauses is a much better option, it will use indices on the three columns and it will only scan them once. Although it's slower than the unpivot by the factor of 2, it's still much much faster than the WHERE EXISTS method.

😎

SELECT

t1.EmailAddress

,t1.FirstName

,t1.LastName

FROM #test_table_1 AS t1

WHERE EmailAddress IN

(SELECT t2.EmailAddress

FROM #test_table_2 AS t2)

OR EmailAddress IN

(SELECT t2.Email1

FROM #test_table_2 AS t2)

OR EmailAddress IN

(SELECT t2.Email2

FROM #test_table_2 AS t2);

VegasL Hall of Fame Points: 3006 More actions · Answer 14

Thanks Hugo & Eirikur Eiriksson again. The query below seems a bit easier to understand. Is there anyway to join tables T1 with T2, where EmailAddress from T1 is joined to either t2.emailaddress,t2.email1, or t2.email3 ? I want to be able to at same time put values in initial select query, what has values from t2, using the where in /or subquery logic you have below.

SELECT
t1.EmailAddress
,t1.FirstName
,t1.LastName
FROM #test_table_1 AS t1
WHERE EmailAddress IN
(SELECT t2.EmailAddress
FROM #test_table_2 AS t2)
OR EmailAddress IN
(SELECT t2.Email1
FROM #test_table_2 AS t2)
OR EmailAddress IN
(SELECT t2.Email2
FROM #test_table_2 AS t2);