WHERE...IN vs JOIN

Question

Post reply

WHERE...IN vs JOIN

landingnormand

SSC Enthusiast

Points: 112
More actions
December 10, 2010 at 10:25 am

#232015

Hi,
I'm creating a stored procedures that accepts a comma delimited string of record ids (e.g. '1,2,3') as a parameter.
I'm considering using two approaches, 1. using the WHERE...IN clause and 2. using a JOIN. I've illustrated both approaches below. @ProductTypeIDs is the input parameter of type nvarchar containing the ids (i.e. '1,2,3') and fn_IntListToTbl is a function that unpacks the string into a table.
Example 1. WHERE...IN
SELECT
ProductID,
ProductName,
ProductTypeID,
ProductDescription
FROM
Products
WHERE
ProductTypeID IN (SELECT number FROM fn_IntListToTbl(@ProductTypeIDs))
Example 2. JOIN
SELECT
ProductID,
ProductName,
ProductTypeID,
ProductDescription
FROM
Products INNER JOIN fn_IntListToTbl(@ProductTypeIDs) i ON ProductTypeID = i.number
I'm seeking advice on which of the two approaches is likely to be best (in terms of performance and least error prone). I know that may be a bit subjective but just wondering if there are hard and fast rules or gotchas I should be aware of (e.g. is the WHERE...IN likley to degrade as the size of the string grows?). I did some tests on which performed best in terms of time taken to execute (caches etc. were cleared before each iteration). In my tests the WHERE...IN was faster by an average of 6% over the JOIN.
In the final version of my stored procedure I'm going to have to return data from more than one table. I did a couple of more tests using the faster of the approaches tested above (i.e. IN...WHERE) which I've illustrated below. This time the Products table is joined to the ProductType table using the key ProductTypeID. In Example 3 the WHERE...IN clause is executed against the larger Products table (i.e. the table containing the ProductTypeID FK) whereas in Example 4 its executed against the smaller ProductTypes table (i.e. the table containing the ProductTypeID PK).
Example 3.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID
WHERE
Products.ProductTypeID IN (SELECT number FROM fn_IDS_IntListToTbl(@ProductTypeIDs))
Example 4.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID
WHERE
ProductTypes.ProductTypeID IN (SELECT number FROM fn_IDS_IntListToTbl(@ProductTypeIDs))
In my tests Example 3 was faster by an average of 50% over Example 4 - a big difference. Is there an obvious explanation why this is the case?
Finally I compared the situation using JOINs. In Example 5 the JOIN is to the FK field in the larger Products table whereas in Example 6 it's to the PK field in the smaller ProductTypes table.
Example 5.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID INNER JOIN
fn_IDS_IntListToTbl(@ProductTypeIDs) i ON Products.ProductTypeID = i.number
Example 6.
SELECT
Products.ProductID,
Products.ProductName,
Products.ProductTypeID,
Products.ProductDescription
FROM
ProductTypes INNER JOIN
Products ON ProductTypes.ProductTypeID = Products.ProductTypeID INNER JOIN
fn_IDS_IntListToTbl(@ProductTypeIDs) i ON ProductTypes.ProductTypeID = i.number
In these tests Example 6 was faster by an average of 12% over Example 5. However both methods were considerably slower than Example 3 (>38%). The final standings in terms of speed between the 4 (equivalent?) methods was Example 3 (fastest), Example 6, Example 5, Example 4 (slowest).
I'm a bit confused. I started off thinking that the WHERE...IN method was quickest with Example 3 being the fastest but then Example 4 was also the slowest. Example 6 suggests its better to join against the smaller PK table (when compared to the larger FK one in Example 5) but in my mind this assumption conflicts with the previous one where the WHERE...IN clause was executed against the smaller table (Example 4) which was the slowest.
Just wondering if anybody could provide any insight/tips?
Thanks

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

Just on the IN and the Join

http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/

As for the rest, look at the exec plan and see what's happening.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

GSquared SSC Guru Points: 260824 More actions · Answer 2

I tend to avoid comma-delimited lists for parameters these days, but when I do use them, I often find it useful to insert them into a temp table as the PK, then join to that.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

landingnormand SSC Enthusiast Points: 112 More actions · Answer 3

Hi Gail,

Thanks for the reply (much appreciated and sorry for the slight delay in getting back to you - I've only just now had a chance to return to this).

Thanks for the pointer to the article - it looks like a more comprehensive examination than mine so I'll certainly look at it in more detail.

Regarding gleaning stuff from the execution plan I'll have to check out what I should be looking for.

Thanks again

landingnormand SSC Enthusiast Points: 112 More actions · Answer 4

Hi GSquared,

Again, thanks for the reply.

I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).

Thanks

GSquared SSC Guru Points: 260824 More actions · Answer 5

Makes sense.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Lamprey13 SSCertifiable Points: 6210 More actions · Answer 6

landingnormand (12/14/2010)
Hi GSquared,
Again, thanks for the reply.
I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).
Thanks

Just for clarification, is the function inserting into a table declared inside the function? If so, that is different from populating a table (variable or temp) from the results of the function. And thus, will probably not perform very well.

Malcolm Daughtree SSCrazy Points: 2382 More actions · Answer 7

Lamprey13 (12/15/2010)
landingnormand (12/14/2010)
Hi GSquared,
Again, thanks for the reply.
I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).
Thanks
Just for clarification, is the function inserting into a table declared inside the function? If so, that is different from populating a table (variable or temp) from the results of the function. And thus, will probably not perform very well.

I'm pretty sure you will find the performance issue is caused by the function in the where clause. The function is evaluated for EVERY record created in the Join- Not the Select.

In real terms the order any query is internally processed in this order

1. From Clause

2. Where - and this includes your funky function 🙂

3. Group by

4. Having

yahoo finally we get to the 5. The Select clause - heaven forbid you should have the same function in here cos it will do it again.

6. and lastly the order by clause.

Source

Inside Microsoft T-SQL Querying by Itzik Ben-Gan

Chapter 1 will open your eyes to 'real' way queries are processed. In SQL 2005 they introduced two excellent constructs that reduce the need and the performance overhead of imbedded funtions the APPLY keyword and 'Conmmon Table expressions' CTE. I would suggest you read up on these and then review the code you posted..

Come into the light and enjoy the joys of Set based SQL...

CodeOn

😛

PEBCAK!

Jeff Moden SSC Guru Points: 1003870 More actions · Answer 8

I know this is an older post but could you post the fn_IntListToTbl, please? Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mad Myche SSCommitted Points: 1593 More actions · Answer 9

I read through this and the first thing I thought of was using fn_Split to break up the List, I wonder if his fn_IntListToTbl is some derivative of that.

I also appreciated seeing the order of the query- I was working on a project a few months ago and one of the JOINs in the view was tied to a constant. I had concluded that it only acted as a WHERE statement, now I can understand the thought behind it

Director of Transmogrification Services

Jeff Moden SSC Guru Points: 1003870 More actions · Answer 10

Mad Myche (12/20/2010)
I read through this and the first thing I thought of was using fn_Split to break up the List, I wonder if his fn_IntListToTbl is some derivative of that.
I also appreciated seeing the order of the query- I was working on a project a few months ago and one of the JOINs in the view was tied to a constant. I had concluded that it only acted as a WHERE statement, now I can understand the thought behind it

Just to be sure, fn_Split isn't an SQL Server provided function. If it has either a recursive CTE or a While Loop in it, you should post it so we can show how how to get around the performance problems associated with such things in a split function.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mad Myche SSCommitted Points: 1593 More actions · Answer 11

I believe the function we are using is the one described at MSDN Sql 2000: Building the Fn_Split() Function, but due to the rules here I cant paste it. I have seen the code identically listed on othe db sites. I just dont know what the linking rules are here. It is WHILE loop based obviously as we are still a 2000 shop

I revisited the JOIN vs WHERE and I did see a performance increase using the JOIN to limit the records in the second query here

SELECT c.Cat, p.CartDesc

FROM Products P

INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId

INNER JOIN ProductCats c ON x.ptcCatId = c.CatId

INNER JOIN ProductCats r ON c.ParentId = r.CatId

WHERE r.CatId = 4

ORDER BY c.Cat, p.CartDesc

SELECT c.Cat, p.CartDesc

FROM Products P

INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId

INNER JOIN ProductCats c ON x.ptcCatId = c.CatId AND 4 = c.ParentID

ORDER BY c.Cat, p.CartDesc

Director of Transmogrification Services

Jeff Moden SSC Guru Points: 1003870 More actions · Answer 12

Mad Myche (12/20/2010)
I believe the function we are using is the one described at MSDN Sql 2000: Building the Fn_Split() Function, but due to the rules here I cant paste it. I have seen the code identically listed on othe db sites. I just dont know what the linking rules are here. It is WHILE loop based obviously as we are still a 2000 shop
I revisited the JOIN vs WHERE and I did see a performance increase using the JOIN to limit the records in the second query here
SELECT c.Cat, p.CartDesc
FROM Products P
INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId
INNER JOIN ProductCats c ON x.ptcCatId = c.CatId
INNER JOIN ProductCats r ON c.ParentId = r.CatId
WHERE r.CatId = 4
ORDER BY c.Cat, p.CartDesc
SELECT c.Cat, p.CartDesc
FROM Products P
INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId
INNER JOIN ProductCats c ON x.ptcCatId = c.CatId AND 4 = c.ParentID
ORDER BY c.Cat, p.CartDesc

Thanks... no need for a While Loop, though, even in SQL Server 2000. Take a look at the following article. When I first learned about it, it opened a whole new world for me and tried to "pass it forward" with that article...

http://www.sqlservercentral.com/articles/T-SQL/62867/

Several optimizations have been made to the splitter code in that article (which also means the article needs an update... someday). I'll be back with a function that uses the optimizations I speak of but you really should read the article.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003870 More actions · Answer 13

Ok... since I don't subscribe to SQL Server Magazine, I can't get to the actual code listings but I cobbled together a non-RBAR function that uses the Tally Table instead of a loop from the "hints" left in the link you provided.

If you haven't read the Tally table article I provided, yet, you should. But, to get you "out of the woods" for testing, here's how to build a permanent Tally table in SQL Server 2000. It's very quick...

--===================================================================

-- Create a Tally table from 1 to 11000

--===================================================================

--===== Create and populate the Tally table on the fly.

-- This ISNULL function makes the column NOT NULL

-- so we can put a Primary Key on it

SELECT TOP 11000

IDENTITY(INT,1,1) AS N

INTO dbo.Tally

FROM Master.dbo.SysColumns sc1

CROSS JOIN Master.dbo.SysColumns sc2

;

--===== Add a CLUSTERED Primary Key to maximize performance

ALTER TABLE dbo.Tally

ADD CONSTRAINT PK_Tally_N

PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

;

--===== Allow the general public to use it

GRANT SELECT ON dbo.Tally TO PUBLIC

;

GO

I don't care for the "Hungarian Notation" nor the column names nor the zero based IDX used in the article but, just to give you what I think is a direct replacement so you don't have to change any of your other code, I followed that example. Here's the code to do an iTVF like in the article to do a split up to 7999 characters without a loop and a small test where you can verify that it works.

CREATE FUNCTION dbo.fn_Split

(

@sText VARCHAR(7999), --Missing one because we need to add 1 delimiter

@sDelim VARCHAR(1) --Just to keep the same datatype

)

RETURNS TABLE AS

RETURN

(

SELECT IDX = t.N - LEN(REPLACE(LEFT(@sText, t.N), @sDelim, '')),

Value = SUBSTRING(@sText, t.N, CHARINDEX(@sDelim, @sText + @sDelim, t.N) -t.N)

FROM dbo.Tally t

WHERE t.N BETWEEN 1 AND DATALENGTH(@sText) + 1 --DATALENGTH allows trailing spaces to be delimiters

AND SUBSTRING(@sDelim + @sText, t.N, 1) = @sDelim

)

GO

SELECT * FROM dbo.fn_Split('1,10,100,1000,10000,100000,1000000,A,B,C,D,E,Jeff Moden',',')

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mad Myche SSCommitted Points: 1593 More actions · Answer 14

Thank you for the great information. I created the Tally Table (mine took 93,516 ms) and created that function. The performance gain was on the order of ~350%. I then trimmed the tally table down to 8000 and that gave me an extra 100% performance.

Running the 2 versions back to back placed 82% of the query cost on the Loop method, and the remaining 18% on the Tally method. This was consistent across parameters varying from 25 characters out to 7950.

Director of Transmogrification Services