Sub-queries vs correlated sub-queries

  • Hi.

    Recently I was asked at interview what is the difference in terms of performace between sub-queries, like ...where in (sub-query), and a correlated subquery like ...where exists (select....) ? I answered that the second option is more efficient, but could not excetly explain why.

    Can someone explain performance difference between them ? Thanks.

  • This article may be of use. That, and this article[/url]. Gail delves into comparing the differences of an inner join v. in and in v. exists. Though they do not discuss the subquery, I think it would prove useful reading for you.

    Then peruse this article[/url] to see what Dave found with his investigations into his comparisons.

    And then finally, we come to the conclusion, IT DEPENDS. There may be a case that the subquery is better for the situation. Other times that the existence check would be better. Other times, performing a join is better.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Despite what the interviewers want to hear, the correct answer IS "It Depends" because it does. Let's see what I mean...

    First, a little test data... a million rows in the main table and 1000 rows in the "side" table should do, eh? Don't be afraid... this only takes about a half minute to build those tables on a 8 year old desktop box.

    --DROP TABLE dbo.JBMTest,dbo.MyInTable

    --===== Do this test in a nice safe place that everyone has...

    USE TempDB

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Create a table with some "WHERE IN" values

    SELECT TOP 1000

    IDENTITY(INT,1,1) AS RowNum,

    ABS(CHECKSUM(NEWID()))%50000+1 AS SomeInt

    INTO dbo.MyInTable

    FROM Master.dbo.SysColumns t1

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE dbo.MyInTable

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Add an index to it

    CREATE INDEX IX_MyInTable_SomeInt

    ON dbo.MyInTable (SomeInt)

    Ok... let's do a test between a normal INNER JOIN, a WHERE IN, and a WHERE EXISTS...

    --===== THE TEST CODE

    SET STATISTICS TIME ON

    SELECT DISTINCT t.*

    FROM dbo.JBMTest t

    INNER JOIN dbo.MyInTable i

    ON t.SomeInt = i.SomeInt

    SELECT t.*

    FROM dbo.JBMTest t

    WHERE t.SomeInt IN (SELECT SomeInt FROM dbo.MyInTable)

    SELECT t.*

    FROM dbo.JBMTest t

    WHERE EXISTS (SELECT 1 FROM dbo.MyInTable i WHERE t.SomeInt = i.SomeInt)

    SET STATISTICS TIME OFF

    Here are two run results from the test code above... note that no one has access to my machine but me and the system is very quiet...

    RUN #1

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 843 ms, elapsed time = 1866 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 844 ms, elapsed time = 1896 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 844 ms, elapsed time = 1775 ms.


    RUN # 2

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 906 ms, elapsed time = 1947 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 688 ms, elapsed time = 1816 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 922 ms, elapsed time = 1683 ms.

    Notice that all 3 methods are virtual ties in the first run and the WHERE EXISTS looses quite badly (CPU wise) to the WHERE IN in the second run. If you run the test code several dozen times, you see that all 3 takes turns tying and winning the differences being caused simply by the "heart beat" of the operating system and SQL Server.

    Let's add an index to the big table and see what happens...

    ----===== Add an index to the larger table

    CREATE INDEX IX_JBMTest_SomeInt

    ON dbo.JBMTest (SomeInt)

    The index makes a relatively huge difference and now only the WHERE IN and WHERE EXISTS take turns winning and tying... Run #1

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 390 ms, elapsed time = 1383 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 219 ms, elapsed time = 1196 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 281 ms, elapsed time = 996 ms.


    Run #2

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 344 ms, elapsed time = 1305 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 281 ms, elapsed time = 1302 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 235 ms, elapsed time = 1075 ms.

    Of course, even a rookie knows that "SELECT *" is a "Bozo-no-no" so lets just select the columns that are covered by indexes.

    --===== THE TEST CODE

    SET STATISTICS TIME ON

    SELECT DISTINCT t.RowNum, t.SomeInt

    FROM dbo.JBMTest t

    INNER JOIN dbo.MyInTable i

    ON t.SomeInt = i.SomeInt

    SELECT t.RowNum, t.SomeInt

    FROM dbo.JBMTest t

    WHERE t.SomeInt IN (SELECT SomeInt FROM dbo.MyInTable)

    SELECT t.RowNum, t.SomeInt

    FROM dbo.JBMTest t

    WHERE EXISTS (SELECT 1 FROM dbo.MyInTable i WHERE t.SomeInt = i.SomeInt)

    SET STATISTICS TIME OFF

    Here are two run results for that. Remember, two runs don't necessarily identify a clear winner. They just show that one or the other has the potential for losing...

    Run #1

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 46 ms, elapsed time = 468 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 547 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 47 ms, elapsed time = 366 ms.


    Run #2

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 78 ms, elapsed time = 556 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 476 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 32 ms, elapsed time = 408 ms.

    Notice that both WHERE IN and WHERE EXISTS both beat the INNER JOIN for this very straight forward code. Some other skilled denizen of this forum could easily write and example the could make the INNER JOIN look the best.

    The bottom line is that, at least in SQL Server 2005, it's an absolute myth that WHERE EXISTS is faster than WHERE IN. It's not a myth that "It Depends". It depends on indexing, it depends on what's in the SELECT list and whether it's covered by an index or not and, as you find out on your own, it also depends on how many joins you have, etc, etc, etc. At least for CPU time.

    Now... what about duration? It usually looks like WHERE EXISTS is the quicker of the two for duration. However... it turns out THAT's just because the system didn't have the room to display the output on the screen for the WHERE EXISTS code... duration includes the time it takes to display the output of a SELECT on the screen (usually... it depends ;-)). Check it out...

    --DROP TABLE #MyHead1, #MyHead2, #MyHead3

    --===== THE TEST CODE

    SET STATISTICS TIME ON

    SELECT DISTINCT t.RowNum, t.SomeInt

    INTO #MyHead1

    FROM dbo.JBMTest t

    INNER JOIN dbo.MyInTable i

    ON t.SomeInt = i.SomeInt

    SELECT t.RowNum, t.SomeInt

    INTO #MyHead2

    FROM dbo.JBMTest t

    WHERE t.SomeInt IN (SELECT SomeInt FROM dbo.MyInTable)

    SELECT t.RowNum, t.SomeInt

    INTO #MyHead3

    FROM dbo.JBMTest t

    WHERE EXISTS (SELECT 1 FROM dbo.MyInTable i WHERE t.SomeInt = i.SomeInt)

    SET STATISTICS TIME OFF

    That code takes the output from each and dumps it into a table. Here are the first two run results from that...

    Run #1

    SQL Server Execution Times:

    CPU time = 125 ms, elapsed time = 120 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 94 ms, elapsed time = 101 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 93 ms, elapsed time = 164 ms.


    Run #2

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:

    CPU time = 125 ms, elapsed time = 130 ms.

    (19868 row(s) affected)

    SQL Server Execution Times:

    CPU time = 110 ms, elapsed time = 106 ms.

    Again, WHERE IN and WHERE EXISTS take turns winning.

    So, we have to continue to add to the proverbial "depends" diaper and say that it also depends on what you're going to do with the output, whether parallelism kicks in, and whether it's Tuesday or not :-D...

    ... be prepared to back it up because there's a very good chance that ALL the interviewers believe in the myth of WHERE EXISTS.

    WHERE NOT EXISTS is a different story but I'm thinking I've just given you enough test code where you can probably write you own test and bring yourself one step closer to being a confident (notice I didn't say "arrogant" or "cocky") "SQL NINJA". Remember what my friend Sergiy says... "A Developer must NOT guess... a Developer must KNOW!" 😉

    Take the time to look at the execution plans of the tests while you're at it. They have a story to tell, as well.:w00t:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • CirquedeSQLeil (1/15/2010)


    This article may be of use. That, and this article[/url]. Gail delves into comparing the differences of an inner join v. in and in v. exists. Though they do not discuss the subquery, I think it would prove useful reading for you.

    Then peruse this article[/url] to see what Dave found with his investigations into his comparisons.

    And then finally, we come to the conclusion, IT DEPENDS. There may be a case that the subquery is better for the situation. Other times that the existence check would be better. Other times, performing a join is better.

    Heh... it took me a while to write my longas response and you've beat me to it with the only correct answer... IT DEPENDS. Hmmmm.... wonder if a loaded "depends" diaper will fit in the pork chop cannon... might be fun to take to an interview. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/15/2010)


    CirquedeSQLeil (1/15/2010)


    This article may be of use. That, and this article[/url]. Gail delves into comparing the differences of an inner join v. in and in v. exists. Though they do not discuss the subquery, I think it would prove useful reading for you.

    Then peruse this article[/url] to see what Dave found with his investigations into his comparisons.

    And then finally, we come to the conclusion, IT DEPENDS. There may be a case that the subquery is better for the situation. Other times that the existence check would be better. Other times, performing a join is better.

    Heh... it took me a while to write my longas response and you've beat me to it with the only correct answer... IT DEPENDS. Hmmmm.... wonder if a loaded "depends" diaper will fit in the pork chop cannon... might be fun to take to an interview. 😛

    LOL. I was thinking I was reading a new article her at SSC when reading your post. That response rocked. Nice setup and examples.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Heh... maybe I should turn it into an article... the question does come up often enough. That's how my very first article was born... the one on hidden RBAR (triangular joins) is almost word for word from a post I wrote way back when including the graphics.

    Thanks for the nice feedback, Jason.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/16/2010)


    Heh... maybe I should turn it into an article... the question does come up often enough. That's how my very first article was born... the one on hidden RBAR (triangular joins) is almost word for word from a post I wrote way back when including the graphics.

    Thanks for the nice feedback, Jason.

    I think it's a good topic for an article. You've already done the legwork - should be a quick turnaround.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Jeff Moden (1/16/2010)


    Heh... maybe I should turn it into an article... the question does come up often enough. That's how my very first article was born... the one on hidden RBAR (triangular joins) is almost word for word from a post I wrote way back when including the graphics.

    Thanks for the nice feedback, Jason.

    Agreed, Jeff. I get so tired of having this argument with folks on-line, who are well meaning but several years out of date. It would be great to just point them to an article from you.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • SQL Guy-482125 (1/15/2010)


    Recently I was asked at interview what is the difference in terms of performace between sub-queries, like ...where in (sub-query), and a correlated subquery like ...where exists (select....) ? I answered that the second option is more efficient, but could not exactly explain why.

    The answer is that often there is no difference at all.

    Whether you write the query using IN or EXISTS is frequently just a semantic choice: the queries are identical, from a logical perspective. The SQL Server Query Optimizer contains extensive rules that are able to transform equivalently-written textual statements to the same, optimal, executable form.

    Take Jeff's tests. The IN and EXISTS forms both compile to exactly the same query plan - a plan which we might expect to be produced from a quite different written form:

    WITH DistinctSomeInt

    AS (

    -- Group the lookup values in the 'IN' or 'EXISTS'

    -- target table. No unique index or primary key

    -- exists to enforce unique values, so based

    -- on statistics, the optimizer may decide to do

    -- a GROUP BY to remove any duplicates. Removing

    -- duplicates from the lookup list won't affect

    -- the final result, but will reduce the number

    -- of joins...

    -- Returning the rows in index order allows

    -- an efficient Stream Aggregate to remove

    -- duplicates (any duplicates are guaranteed to be

    -- adjacent in the data stream)

    SELECT MIT.SomeInt

    FROM dbo.MyInTable MIT

    WITH (INDEX(IX_MyInTable_SomeInt))

    GROUP BY

    MIT.SomeInt

    ),

    OrderedTestTable

    AS (

    -- Access the columns we need for the output

    -- from the base table, using the covering

    -- index, returning rows in index order

    -- (Ordered: True)

    SELECT JBMT.RowNum,

    JBMT.SomeInt

    FROM dbo.JBMTest JBMT

    WITH (INDEX(IX_JBMTest_SomeInt))

    )

    SELECT OrderedTestTable.RowNum,

    OrderedTestTable.SomeInt

    FROM DistinctSomeInt

    -- Enforce a loop join, and force the join order

    INNER

    LOOP

    JOIN OrderedTestTable

    ON OrderedTestTable.SomeInt = DistinctSomeInt.SomeInt

    OPTION (FORCE ORDER, ORDER GROUP); -- ORDER GROUP to enforce the Stream Aggregate

    It's quite amazing how much work the optimizer does on your behalf sometimes 🙂

    So, the correct answer to the original question is possibly: "Mu!"

    See Craig Freedman's excellent blog for further examples of transformations and other plan optimizations available in SQL Server. The link there will take you to a quite appropriate example of transforming a semi join (EXISTS) to an INNER JOIN.

    Paul

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply