WHERE...IN vs JOIN

  • Hi,

    I'm creating a stored procedures that accepts a comma delimited string of record ids (e.g. '1,2,3') as a parameter.

    I'm considering using two approaches, 1. using the WHERE...IN clause and 2. using a JOIN. I've illustrated both approaches below. @ProductTypeIDs is the input parameter of type nvarchar containing the ids (i.e. '1,2,3') and fn_IntListToTbl is a function that unpacks the string into a table.

    Example 1. WHERE...IN

    SELECT

    ProductID,

    ProductName,

    ProductTypeID,

    ProductDescription

    FROM

    Products

    WHERE

    ProductTypeID IN (SELECT number FROM fn_IntListToTbl(@ProductTypeIDs))

    Example 2. JOIN

    SELECT

    ProductID,

    ProductName,

    ProductTypeID,

    ProductDescription

    FROM

    Products INNER JOIN fn_IntListToTbl(@ProductTypeIDs) i ON ProductTypeID = i.number

    I'm seeking advice on which of the two approaches is likely to be best (in terms of performance and least error prone). I know that may be a bit subjective but just wondering if there are hard and fast rules or gotchas I should be aware of (e.g. is the WHERE...IN likley to degrade as the size of the string grows?). I did some tests on which performed best in terms of time taken to execute (caches etc. were cleared before each iteration). In my tests the WHERE...IN was faster by an average of 6% over the JOIN.

    In the final version of my stored procedure I'm going to have to return data from more than one table. I did a couple of more tests using the faster of the approaches tested above (i.e. IN...WHERE) which I've illustrated below. This time the Products table is joined to the ProductType table using the key ProductTypeID. In Example 3 the WHERE...IN clause is executed against the larger Products table (i.e. the table containing the ProductTypeID FK) whereas in Example 4 its executed against the smaller ProductTypes table (i.e. the table containing the ProductTypeID PK).

    Example 3.

    SELECT

    Products.ProductID,

    Products.ProductName,

    Products.ProductTypeID,

    Products.ProductDescription

    FROM

    ProductTypes INNER JOIN

    Products ON ProductTypes.ProductTypeID = Products.ProductTypeID

    WHERE

    Products.ProductTypeID IN (SELECT number FROM fn_IDS_IntListToTbl(@ProductTypeIDs))

    Example 4.

    SELECT

    Products.ProductID,

    Products.ProductName,

    Products.ProductTypeID,

    Products.ProductDescription

    FROM

    ProductTypes INNER JOIN

    Products ON ProductTypes.ProductTypeID = Products.ProductTypeID

    WHERE

    ProductTypes.ProductTypeID IN (SELECT number FROM fn_IDS_IntListToTbl(@ProductTypeIDs))

    In my tests Example 3 was faster by an average of 50% over Example 4 - a big difference. Is there an obvious explanation why this is the case?

    Finally I compared the situation using JOINs. In Example 5 the JOIN is to the FK field in the larger Products table whereas in Example 6 it's to the PK field in the smaller ProductTypes table.

    Example 5.

    SELECT

    Products.ProductID,

    Products.ProductName,

    Products.ProductTypeID,

    Products.ProductDescription

    FROM

    ProductTypes INNER JOIN

    Products ON ProductTypes.ProductTypeID = Products.ProductTypeID INNER JOIN

    fn_IDS_IntListToTbl(@ProductTypeIDs) i ON Products.ProductTypeID = i.number

    Example 6.

    SELECT

    Products.ProductID,

    Products.ProductName,

    Products.ProductTypeID,

    Products.ProductDescription

    FROM

    ProductTypes INNER JOIN

    Products ON ProductTypes.ProductTypeID = Products.ProductTypeID INNER JOIN

    fn_IDS_IntListToTbl(@ProductTypeIDs) i ON ProductTypes.ProductTypeID = i.number

    In these tests Example 6 was faster by an average of 12% over Example 5. However both methods were considerably slower than Example 3 (>38%). The final standings in terms of speed between the 4 (equivalent?) methods was Example 3 (fastest), Example 6, Example 5, Example 4 (slowest).

    I'm a bit confused. I started off thinking that the WHERE...IN method was quickest with Example 3 being the fastest but then Example 4 was also the slowest. Example 6 suggests its better to join against the smaller PK table (when compared to the larger FK one in Example 5) but in my mind this assumption conflicts with the previous one where the WHERE...IN clause was executed against the smaller table (Example 4) which was the slowest.

    Just wondering if anybody could provide any insight/tips?

    Thanks

  • Just on the IN and the Join

    http://sqlinthewild.co.za/index.php/2010/04/27/in-exists-and-join-a-roundup/

    As for the rest, look at the exec plan and see what's happening.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I tend to avoid comma-delimited lists for parameters these days, but when I do use them, I often find it useful to insert them into a temp table as the PK, then join to that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Hi Gail,

    Thanks for the reply (much appreciated and sorry for the slight delay in getting back to you - I've only just now had a chance to return to this).

    Thanks for the pointer to the article - it looks like a more comprehensive examination than mine so I'll certainly look at it in more detail.

    Regarding gleaning stuff from the execution plan I'll have to check out what I should be looking for.

    Thanks again

  • Hi GSquared,

    Again, thanks for the reply.

    I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).

    Thanks

  • Makes sense.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • landingnormand (12/14/2010)


    Hi GSquared,

    Again, thanks for the reply.

    I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).

    Thanks

    Just for clarification, is the function inserting into a table declared inside the function? If so, that is different from populating a table (variable or temp) from the results of the function. And thus, will probably not perform very well.

  • Lamprey13 (12/15/2010)


    landingnormand (12/14/2010)


    Hi GSquared,

    Again, thanks for the reply.

    I know comma-delimited lists for parameters isn't ideal but I don't know of an alternative (in 2005). The function I have used in my example does as you suggest (i.e. it inserts each value into a table with a single column called number against which I can perform joins).

    Thanks

    Just for clarification, is the function inserting into a table declared inside the function? If so, that is different from populating a table (variable or temp) from the results of the function. And thus, will probably not perform very well.

    I'm pretty sure you will find the performance issue is caused by the function in the where clause. The function is evaluated for EVERY record created in the Join- Not the Select.

    In real terms the order any query is internally processed in this order

    1. From Clause

    2. Where - and this includes your funky function 🙂

    3. Group by

    4. Having

    yahoo finally we get to the 5. The Select clause - heaven forbid you should have the same function in here cos it will do it again.

    6. and lastly the order by clause.

    Source

    Inside Microsoft T-SQL Querying by Itzik Ben-Gan

    Chapter 1 will open your eyes to 'real' way queries are processed. In SQL 2005 they introduced two excellent constructs that reduce the need and the performance overhead of imbedded funtions the APPLY keyword and 'Conmmon Table expressions' CTE. I would suggest you read up on these and then review the code you posted..

    Come into the light and enjoy the joys of Set based SQL...

    CodeOn

    😛

    PEBCAK!

  • I know this is an older post but could you post the fn_IntListToTbl, please? Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I read through this and the first thing I thought of was using fn_Split to break up the List, I wonder if his fn_IntListToTbl is some derivative of that.

    I also appreciated seeing the order of the query- I was working on a project a few months ago and one of the JOINs in the view was tied to a constant. I had concluded that it only acted as a WHERE statement, now I can understand the thought behind it

    Director of Transmogrification Services
  • Mad Myche (12/20/2010)


    I read through this and the first thing I thought of was using fn_Split to break up the List, I wonder if his fn_IntListToTbl is some derivative of that.

    I also appreciated seeing the order of the query- I was working on a project a few months ago and one of the JOINs in the view was tied to a constant. I had concluded that it only acted as a WHERE statement, now I can understand the thought behind it

    Just to be sure, fn_Split isn't an SQL Server provided function. If it has either a recursive CTE or a While Loop in it, you should post it so we can show how how to get around the performance problems associated with such things in a split function.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I believe the function we are using is the one described at MSDN Sql 2000: Building the Fn_Split() Function, but due to the rules here I cant paste it. I have seen the code identically listed on othe db sites. I just dont know what the linking rules are here. It is WHILE loop based obviously as we are still a 2000 shop

    I revisited the JOIN vs WHERE and I did see a performance increase using the JOIN to limit the records in the second query here

    SELECT c.Cat, p.CartDesc

    FROM Products P

    INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId

    INNER JOIN ProductCats c ON x.ptcCatId = c.CatId

    INNER JOIN ProductCats r ON c.ParentId = r.CatId

    WHERE r.CatId = 4

    ORDER BY c.Cat, p.CartDesc

    SELECT c.Cat, p.CartDesc

    FROM Products P

    INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId

    INNER JOIN ProductCats c ON x.ptcCatId = c.CatId AND 4 = c.ParentID

    ORDER BY c.Cat, p.CartDesc

    Director of Transmogrification Services
  • Mad Myche (12/20/2010)


    I believe the function we are using is the one described at MSDN Sql 2000: Building the Fn_Split() Function, but due to the rules here I cant paste it. I have seen the code identically listed on othe db sites. I just dont know what the linking rules are here. It is WHILE loop based obviously as we are still a 2000 shop

    I revisited the JOIN vs WHERE and I did see a performance increase using the JOIN to limit the records in the second query here

    SELECT c.Cat, p.CartDesc

    FROM Products P

    INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId

    INNER JOIN ProductCats c ON x.ptcCatId = c.CatId

    INNER JOIN ProductCats r ON c.ParentId = r.CatId

    WHERE r.CatId = 4

    ORDER BY c.Cat, p.CartDesc

    SELECT c.Cat, p.CartDesc

    FROM Products P

    INNER JOIN Products2Categories x ON p.ProductId = x.ptcProdId

    INNER JOIN ProductCats c ON x.ptcCatId = c.CatId AND 4 = c.ParentID

    ORDER BY c.Cat, p.CartDesc

    Thanks... no need for a While Loop, though, even in SQL Server 2000. Take a look at the following article. When I first learned about it, it opened a whole new world for me and tried to "pass it forward" with that article...

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    Several optimizations have been made to the splitter code in that article (which also means the article needs an update... someday). I'll be back with a function that uses the optimizations I speak of but you really should read the article.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok... since I don't subscribe to SQL Server Magazine, I can't get to the actual code listings but I cobbled together a non-RBAR function that uses the Tally Table instead of a loop from the "hints" left in the link you provided.

    If you haven't read the Tally table article I provided, yet, you should. But, to get you "out of the woods" for testing, here's how to build a permanent Tally table in SQL Server 2000. It's very quick...

    --===================================================================

    -- Create a Tally table from 1 to 11000

    --===================================================================

    --===== Create and populate the Tally table on the fly.

    -- This ISNULL function makes the column NOT NULL

    -- so we can put a Primary Key on it

    SELECT TOP 11000

    IDENTITY(INT,1,1) AS N

    INTO dbo.Tally

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2

    ;

    --===== Add a CLUSTERED Primary Key to maximize performance

    ALTER TABLE dbo.Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    ;

    --===== Allow the general public to use it

    GRANT SELECT ON dbo.Tally TO PUBLIC

    ;

    GO

    I don't care for the "Hungarian Notation" nor the column names nor the zero based IDX used in the article but, just to give you what I think is a direct replacement so you don't have to change any of your other code, I followed that example. Here's the code to do an iTVF like in the article to do a split up to 7999 characters without a loop and a small test where you can verify that it works.

    CREATE FUNCTION dbo.fn_Split

    (

    @sText VARCHAR(7999), --Missing one because we need to add 1 delimiter

    @sDelim VARCHAR(1) --Just to keep the same datatype

    )

    RETURNS TABLE AS

    RETURN

    (

    SELECT IDX = t.N - LEN(REPLACE(LEFT(@sText, t.N), @sDelim, '')),

    Value = SUBSTRING(@sText, t.N, CHARINDEX(@sDelim, @sText + @sDelim, t.N) -t.N)

    FROM dbo.Tally t

    WHERE t.N BETWEEN 1 AND DATALENGTH(@sText) + 1 --DATALENGTH allows trailing spaces to be delimiters

    AND SUBSTRING(@sDelim + @sText, t.N, 1) = @sDelim

    )

    GO

    SELECT * FROM dbo.fn_Split('1,10,100,1000,10000,100000,1000000,A,B,C,D,E,Jeff Moden',',')

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you for the great information. I created the Tally Table (mine took 93,516 ms) and created that function. The performance gain was on the order of ~350%. I then trimmed the tally table down to 8000 and that gave me an extra 100% performance.

    Running the 2 versions back to back placed 82% of the query cost on the Loop method, and the remaining 18% on the Tally method. This was consistent across parameters varying from 25 characters out to 7950.

    Director of Transmogrification Services

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply