Where is it best to place "where conditions" in the join or inthe final where statement

Question

Post reply

Where is it best to place "where conditions" in the join or inthe final where statement

firthr

SSCommitted

Points: 1606
More actions
April 30, 2004 at 6:46 am

#161469

Is it best practice to ensure filtering is done at the time of an inner join or at the end with a filter in the where clause. Which would mean less data needs to be checked, and filtered and therefor is more efficient would you say. e..g
Select * from x
inner join y on x.id=y.id
and y.age > 20
OR
Select * from x
inner join y on x.id=y.id
where y.age > 20.
I'm thinking the inner join clause is best, but then my initial stats suggest the where is quicker.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

SteveB-74825 Hall of Fame Points: 3290 More actions · Answer 1

i guess it might depend on your data somewhat

But. The where clause is where sql server would expect to find it.

I've always done joins on the values of one column and restriced the results elsewhere but that is a personal thing - don't know it's any slower or faster -

cheers

dbgeezer

zach_john SSCertifiable Points: 5922 More actions · Answer 2

Try this in Northwind and then look at the execuion plan.

Select * from "Order details" od

inner join Orders o

on od.orderid = o.orderid

and o.orderdate > '5-01-98'

go

Select * from "order Details" od

inner join orders o

on od.orderid=o.orderid where o.orderdate > '5-01-98'

John Zacharkan

SteveB-74825 Hall of Fame Points: 3290 More actions · Answer 3

i'd still stick with the where clause for readability

cheers

dbgeezer

firthr SSCommitted Points: 1606 More actions · Answer 4

aaah, see what you mean by the execution plans.. they're the same, and the same for my own variations in code too.

I sort of like a bit of both, keeping the conditions with the inner joins so that it clearly identifies the table restrictions and in the where section when it can change for each run.

thanks very much

rob.

Peter P SSCrazy Points: 2777 More actions · Answer 5

I think the optimizer in SQL Server resolves a lot of these types of issues too.

gljjr SSCrazy Eights Points: 9970 More actions · Answer 6

For inner joins I don't think it really matters as noted by the optimizer doing the same thing. However if you go to an outer join many times it HAS to be part of the join clause! Consider the following...

SELECT *

FROM [Order Details] od

LEFT JOIN Orders o

ON od.OrderID = o.OrderID

AND o.OrderDate > '5-01-98'

AND o.ShippedDate IS NOT NULL

SELECT *

FROM [Order Details] od

LEFT JOIN Orders o

ON od.OrderID = o.OrderID

AND o.OrderDate > '5-01-98'

WHERE o.ShippedDate IS NOT NULL

Gary Johnson
Microsoft Natural Language Group
DBA, Sr. DB Engineer

This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.

Jeff Moden SSC Guru Points: 1003875 More actions · Answer 7

If you are using an outer join as a replacement for a "WHERE NOT IN", then the filter must be in the WHERE clause as in Gary's second example above.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003875 More actions · Answer 8

Thought I'd clarify my statement about using an Outer Join as a replacement for a "Where Not In"...

About outer joins... I'm sure you've heard all sorts of horror stories about how outer joins return everything and are evil, etc. Those stories are true when told by those that don't know how to use them.

One of the best uses for outer joins is as a replacement for "NOT IN". Hint: anytime you use NOT IN or <>, you have just defeated the optimizer in SQL Server. So, the general idea is to keep everything in a positive manner as in "IN" (still a bit slow) or "=" (nasty fast!).

To demo, I created the following 2 tables:

CREATE TABLE [dbo].[TestJoin1] (

[ID] [int] IDENTITY (1, 1) NOT NULL ,

[AColumn] [varchar] (10) NULL )

CREATE TABLE [dbo].[TestJoin2] (

[ID] [int] IDENTITY (1, 1) NOT NULL ,

[AnotherCol] [varchar] (10) NULL )

TestJoin1 contains the following information:

ID AColumn

----------- ----------

1 A

2 B

3 C

4 D

5 E

6 F

(6 row(s) affected)

TestJoin2 contains the following information:

ID AnotherCol

----------- ----------

2 B

4 D

5 E

(3 row(s) affected)

The problem, for discussion purposes is to select all IDs from TestJoin1 that are NOT IN TestJoin2. Most people would write the SQL literally and kill themselves on speed when using big tables like this:

SELECT *

FROM TestJoin1

WHERE ID NOT IN (SELECT ID FROM TestJoin2)

and the results would be:

ID AColumn

----------- ----------

1 A

3 C

6 F

(3 row(s) affected)

Again, the problem is to NOT USE "NOT IN" because "NOT IN" cannot be optimized.

First, let's just try a simple LEFT OUTER JOIN where TestJoin1 is the "LEFT" table:

SELECT j1.*,j2.*

FROM TestJoin1 j1

LEFT OUTER JOIN TestJoin2 j2

ON j1.ID = j2.ID

and the results would not be anything like we wanted but do notice that, even though there wasn't a match in 3 of the records, ALL of the records from the LEFT table (TestJoin1 in this case) were returned (see below). That's what an outer join does. To identify which table to do this with, LEFT or RIGHT are added to the OUTER JOIN statement. Which table is the LEFT table? The one to the LEFT (or above, in this case) of the LEFT OUTER JOIN clause. The other table to the right of the outer join is the RIGHT table.

ID AColumn ID AnotherCol

----------- ---------- ----------- ----------

1 A NULL NULL

2 B 2 B

3 C NULL NULL

4 D 4 D

5 E 5 E

6 F NULL NULL

(6 row(s) affected)

Notice the pattern? Again, the problem is to select everything from TestJoin1 where the ID is "NOT IN" TestJoin2. Notice that where that is true, there are NULLs in the return set (right 2 columns are from TestJoin2) above? So, all we have to do is add a filter in a WHERE clause and we have exactly what (with extra fields for demo) we want using only "positive" logic in the form of "=" and "IS" like this:

SELECT j1.*,j2.*

FROM TestJoin1 j1

LEFT OUTER JOIN TestJoin2 j2

ON j1.ID = j2.ID

WHERE j2.ID IS NULL

and the results look like this...

ID AColumn ID AnotherCol

----------- ---------- ----------- ----------

1 A NULL NULL

3 C NULL NULL

6 F NULL NULL

(3 row(s) affected)

...and, if we remove all reference to TestJoin2 from the SELECT clause field list like this:

SELECT j1.*

FROM TestJoin1 j1

LEFT OUTER JOIN TestJoin2 j2

ON j1.ID = j2.ID

WHERE j2.ID IS NULL

...we get exactly what we wanted... All of the fields for all of the records in TestJoin1 whose ID is NOT IN TestJoin2...

ID AColumn

----------- ----------

1 A

3 C

6 F

(3 row(s) affected)

...And will run nasty fast even on big tables!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SteveB-74825 Hall of Fame Points: 3290 More actions · Answer 9

is a left outer join faster than NOT EXISTS then ?

cheers

dbgeezer

David Burrows SSC Guru Points: 65138 More actions · Answer 10

is a left outer join faster than NOT EXISTS then ?

Generally yes but not always. I depends on the table definitions, PK's indexes and so forth.

I have done tests before and found with some tables the plan is the same and the time the same but others NOT EXISTS is 3 times faster.

What I do is try different queries and see which has the best plan / execution time (and make sure you do the tests with different data volume, the number of rows can affect query speed drastically).

Far away is close at hand in the images of elsewhere.
Anon.

gljjr SSCrazy Eights Points: 9970 More actions · Answer 11

I've found the same as David. It really depends on the data and how much of it you have!

BTW: My query above was NOT about a Not Exists query but to show that you need to structure your outer joins carefully to get the correct data. In the query I posted above the ShippedDate allows Null values. The same thing would have happened if I had used an exact date.

SELECT *

FROM [Order Details] od

LEFT JOIN Orders o

ON od.OrderID = o.OrderID

AND o.OrderDate > '5-01-98'

AND o.ShippedDate = CONVERT(datetime,'1998-05-06')

SELECT *

FROM [Order Details] od

LEFT JOIN Orders o

ON od.OrderID = o.OrderID

AND o.OrderDate > '5-01-98'

WHERE o.ShippedDate = CONVERT(datetime,'1998-05-06')

Gary Johnson
Microsoft Natural Language Group
DBA, Sr. DB Engineer

This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.

scottlsimpkins Old Hand Points: 338 More actions · Answer 12

I have the same question, but am using a "NOT EXIST" functionality. We are experience tremondous slowness running this query. Do you have any ideas on how I can increase query time?

select *

from rebate rb

where unit > ' '

and file_dte = '2004-05-02'

and sys_nam like '%odsbeta%'

and incentiverevcode = 1

and not exists (select 'x'

from rebate rb2

where rb.[group] = rb2.[group]

and rb.unit = rb2.unit

and rb.enterprisecode = rb2.enterprisecode

and rb.programcode = rb2.programcode

and rb2.incentiverevcode = 0

and rb2.sys_nam not like '%odsbeta%')

Jeff Moden SSC Guru Points: 1003875 More actions · Answer 13

It, of course, depends on the SELECT statement in the NOT EXISTS but, I'd have to say "Yes", in most cases (was going to say ALWAYS but, sure-as-shootin', somebody would find the exception )

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Kenneth Wilhelmsson SSC-Dedicated Points: 30043 More actions · Answer 14

You have two major performance killers in this query.

in the outer query:

and sys_nam like '%odsbeta%'

By using a leading wildcard like this, you are effectively forcing a table scan on table rebate

in the subselect:

and rb2.sys_nam not like '%odsbeta%'

Here we have both a leading wildcard and a negative (not like), both of these constructs also forces a scan of rebate.

To make matters even worse, the subselect is part of a NOT EXISTS clause.

This means that table rebate will be scanned from top to bottom once for each row returned by the outer query (select * from rebate...)

As you can imagine, this will probably quite quickly lead to degrading performance.

To fix, try to avoid leading wildcards at any cost.

Also avoid negatives - ie NOT IN and such.

NOT EXISTS is ok - it's not a negative predicate in the same sense as NOT IN, it's a boolean expression resolving to either true or false - the trick for NOT EXISTS is to make it resolve as quickly as possible. This means that the subselect has to be as efficient as possible.

To sum up, your main problem is not the NOT EXISTS clause per se, it's in the outer query and in the resolver of NOT EXISTS..

..makes any sense?

=;o)

/Kenneth