Sub-queries vs correlated sub-queries

Question

Sub-queries vs correlated sub-queries

SQL Guy 1

SSCoach

Points: 15817
More actions
January 15, 2010 at 1:58 pm

#215705

Hi.
Recently I was asked at interview what is the difference in terms of performace between sub-queries, like ...where in (sub-query), and a correlated subquery like ...where exists (select....) ? I answered that the second option is more efficient, but could not excetly explain why.
Can someone explain performance difference between them ? Thanks.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

SQLRNNR SSC Guru Points: 281334 More actions · Answer 1

This article may be of use. That, and this article[/url]. Gail delves into comparing the differences of an inner join v. in and in v. exists. Though they do not discuss the subquery, I think it would prove useful reading for you.

Then peruse this article[/url] to see what Dave found with his investigations into his comparisons.

And then finally, we come to the conclusion, IT DEPENDS. There may be a case that the subquery is better for the situation. Other times that the existence check would be better. Other times, performing a join is better.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 2

Despite what the interviewers want to hear, the correct answer IS "It Depends" because it does. Let's see what I mean...

First, a little test data... a million rows in the main table and 1000 rows in the "side" table should do, eh? Don't be afraid... this only takes about a half minute to build those tables on a 8 year old desktop box.

--DROP TABLE dbo.JBMTest,dbo.MyInTable

--===== Do this test in a nice safe place that everyone has...

USE TempDB

GO

--===== Create and populate a 1,000,000 row test table.

-- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

-- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

-- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

-- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

-- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

-- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

-- for all rows.

-- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

-- Jeff Moden

SELECT TOP 1000000

RowNum = IDENTITY(INT,1,1),

SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),

SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

SomeHex12 = RIGHT(NEWID(),12)

INTO dbo.JBMTest

FROM Master.dbo.SysColumns t1

CROSS JOIN Master.dbo.SysColumns t2

--===== A table is not properly formed unless a Primary Key has been assigned

ALTER TABLE dbo.JBMTest

ADD PRIMARY KEY CLUSTERED (RowNum)

--===== Create a table with some "WHERE IN" values

SELECT TOP 1000

IDENTITY(INT,1,1) AS RowNum,

ABS(CHECKSUM(NEWID()))%50000+1 AS SomeInt

INTO dbo.MyInTable

FROM Master.dbo.SysColumns t1

--===== A table is not properly formed unless a Primary Key has been assigned

ALTER TABLE dbo.MyInTable

ADD PRIMARY KEY CLUSTERED (RowNum)

--===== Add an index to it

CREATE INDEX IX_MyInTable_SomeInt

ON dbo.MyInTable (SomeInt)

Ok... let's do a test between a normal INNER JOIN, a WHERE IN, and a WHERE EXISTS...

--===== THE TEST CODE

SET STATISTICS TIME ON

SELECT DISTINCT t.*

FROM dbo.JBMTest t

INNER JOIN dbo.MyInTable i

ON t.SomeInt = i.SomeInt

SELECT t.*

FROM dbo.JBMTest t

WHERE t.SomeInt IN (SELECT SomeInt FROM dbo.MyInTable)

SELECT t.*

FROM dbo.JBMTest t

WHERE EXISTS (SELECT 1 FROM dbo.MyInTable i WHERE t.SomeInt = i.SomeInt)

SET STATISTICS TIME OFF

Here are two run results from the test code above... note that no one has access to my machine but me and the system is very quiet...

RUN #1

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 843 ms, elapsed time = 1866 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 844 ms, elapsed time = 1896 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 844 ms, elapsed time = 1775 ms.

RUN # 2
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 906 ms, elapsed time = 1947 ms.
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 688 ms, elapsed time = 1816 ms.
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 922 ms, elapsed time = 1683 ms.

Notice that all 3 methods are virtual ties in the first run and the WHERE EXISTS looses quite badly (CPU wise) to the WHERE IN in the second run. If you run the test code several dozen times, you see that all 3 takes turns tying and winning the differences being caused simply by the "heart beat" of the operating system and SQL Server.

Let's add an index to the big table and see what happens...

----===== Add an index to the larger table

CREATE INDEX IX_JBMTest_SomeInt

ON dbo.JBMTest (SomeInt)

The index makes a relatively huge difference and now only the WHERE IN and WHERE EXISTS take turns winning and tying... Run #1

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 390 ms, elapsed time = 1383 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 219 ms, elapsed time = 1196 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 281 ms, elapsed time = 996 ms.

Run #2
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 344 ms, elapsed time = 1305 ms.
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 281 ms, elapsed time = 1302 ms.
(19868 row(s) affected)
SQL Server Execution Times:

CPU time = 235 ms, elapsed time = 1075 ms.

Of course, even a rookie knows that "SELECT *" is a "Bozo-no-no" so lets just select the columns that are covered by indexes.

--===== THE TEST CODE

SET STATISTICS TIME ON

SELECT DISTINCT t.RowNum, t.SomeInt

FROM dbo.JBMTest t

INNER JOIN dbo.MyInTable i

ON t.SomeInt = i.SomeInt

SELECT t.RowNum, t.SomeInt

FROM dbo.JBMTest t

WHERE t.SomeInt IN (SELECT SomeInt FROM dbo.MyInTable)

SELECT t.RowNum, t.SomeInt

FROM dbo.JBMTest t

WHERE EXISTS (SELECT 1 FROM dbo.MyInTable i WHERE t.SomeInt = i.SomeInt)

SET STATISTICS TIME OFF

Here are two run results for that. Remember, two runs don't necessarily identify a clear winner. They just show that one or the other has the potential for losing...

Run #1

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 46 ms, elapsed time = 468 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 547 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 47 ms, elapsed time = 366 ms.

Run #2
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 78 ms, elapsed time = 556 ms.
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 0 ms, elapsed time = 476 ms.
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 32 ms, elapsed time = 408 ms.

Notice that both WHERE IN and WHERE EXISTS both beat the INNER JOIN for this very straight forward code. Some other skilled denizen of this forum could easily write and example the could make the INNER JOIN look the best.

The bottom line is that, at least in SQL Server 2005, it's an absolute myth that WHERE EXISTS is faster than WHERE IN. It's not a myth that "It Depends". It depends on indexing, it depends on what's in the SELECT list and whether it's covered by an index or not and, as you find out on your own, it also depends on how many joins you have, etc, etc, etc. At least for CPU time.

Now... what about duration? It usually looks like WHERE EXISTS is the quicker of the two for duration. However... it turns out THAT's just because the system didn't have the room to display the output on the screen for the WHERE EXISTS code... duration includes the time it takes to display the output of a SELECT on the screen (usually... it depends ;-)). Check it out...

--DROP TABLE #MyHead1, #MyHead2, #MyHead3

--===== THE TEST CODE

SET STATISTICS TIME ON

SELECT DISTINCT t.RowNum, t.SomeInt

INTO #MyHead1

FROM dbo.JBMTest t

INNER JOIN dbo.MyInTable i

ON t.SomeInt = i.SomeInt

SELECT t.RowNum, t.SomeInt

INTO #MyHead2

FROM dbo.JBMTest t

WHERE t.SomeInt IN (SELECT SomeInt FROM dbo.MyInTable)

SELECT t.RowNum, t.SomeInt

INTO #MyHead3

FROM dbo.JBMTest t

WHERE EXISTS (SELECT 1 FROM dbo.MyInTable i WHERE t.SomeInt = i.SomeInt)

SET STATISTICS TIME OFF

That code takes the output from each and dumps it into a table. Here are the first two run results from that...

Run #1

SQL Server Execution Times:

CPU time = 125 ms, elapsed time = 120 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 94 ms, elapsed time = 101 ms.

(19868 row(s) affected)

SQL Server Execution Times:

CPU time = 93 ms, elapsed time = 164 ms.

Run #2
SQL Server parse and compile time:
 CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
 CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
 CPU time = 125 ms, elapsed time = 130 ms.
(19868 row(s) affected)
SQL Server Execution Times:
 CPU time = 110 ms, elapsed time = 106 ms.

Again, WHERE IN and WHERE EXISTS take turns winning.

So, we have to continue to add to the proverbial "depends" diaper and say that it also depends on what you're going to do with the output, whether parallelism kicks in, and whether it's Tuesday or not :-D...

... be prepared to back it up because there's a very good chance that ALL the interviewers believe in the myth of WHERE EXISTS.

WHERE NOT EXISTS is a different story but I'm thinking I've just given you enough test code where you can probably write you own test and bring yourself one step closer to being a confident (notice I didn't say "arrogant" or "cocky") "SQL NINJA". Remember what my friend Sergiy says... "A Developer must NOT guess... a Developer must KNOW!" 😉

Take the time to look at the execution plans of the tests while you're at it. They have a story to tell, as well.:w00t:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 3

CirquedeSQLeil (1/15/2010)
This article may be of use. That, and this article[/url]. Gail delves into comparing the differences of an inner join v. in and in v. exists. Though they do not discuss the subquery, I think it would prove useful reading for you.
Then peruse this article[/url] to see what Dave found with his investigations into his comparisons.
And then finally, we come to the conclusion, IT DEPENDS. There may be a case that the subquery is better for the situation. Other times that the existence check would be better. Other times, performing a join is better.

Heh... it took me a while to write my longas response and you've beat me to it with the only correct answer... IT DEPENDS. Hmmmm.... wonder if a loaded "depends" diaper will fit in the pork chop cannon... might be fun to take to an interview. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRNNR SSC Guru Points: 281334 More actions · Answer 4

Jeff Moden (1/15/2010)
CirquedeSQLeil (1/15/2010)
This article may be of use. That, and this article[/url]. Gail delves into comparing the differences of an inner join v. in and in v. exists. Though they do not discuss the subquery, I think it would prove useful reading for you.
Then peruse this article[/url] to see what Dave found with his investigations into his comparisons.
And then finally, we come to the conclusion, IT DEPENDS. There may be a case that the subquery is better for the situation. Other times that the existence check would be better. Other times, performing a join is better.
Heh... it took me a while to write my longas response and you've beat me to it with the only correct answer... IT DEPENDS. Hmmmm.... wonder if a loaded "depends" diaper will fit in the pork chop cannon... might be fun to take to an interview. 😛

LOL. I was thinking I was reading a new article her at SSC when reading your post. That response rocked. Nice setup and examples.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 5

Heh... maybe I should turn it into an article... the question does come up often enough. That's how my very first article was born... the one on hidden RBAR (triangular joins) is almost word for word from a post I wrote way back when including the graphics.

Thanks for the nice feedback, Jason.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQLRNNR SSC Guru Points: 281334 More actions · Answer 6

Jeff Moden (1/16/2010)
Heh... maybe I should turn it into an article... the question does come up often enough. That's how my very first article was born... the one on hidden RBAR (triangular joins) is almost word for word from a post I wrote way back when including the graphics.
Thanks for the nice feedback, Jason.

I think it's a good topic for an article. You've already done the legwork - should be a quick turnaround.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

RBarryYoung SSC Guru Points: 143329 More actions · Answer 7

Jeff Moden (1/16/2010)
Heh... maybe I should turn it into an article... the question does come up often enough. That's how my very first article was born... the one on hidden RBAR (triangular joins) is almost word for word from a post I wrote way back when including the graphics.
Thanks for the nice feedback, Jason.

Agreed, Jeff. I get so tired of having this argument with folks on-line, who are well meaning but several years out of date. It would be great to just point them to an article from you.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Paul White SSC Guru Points: 150467 More actions · Answer 8

SQL Guy-482125 (1/15/2010)
Recently I was asked at interview what is the difference in terms of performace between sub-queries, like ...where in (sub-query), and a correlated subquery like ...where exists (select....) ? I answered that the second option is more efficient, but could not exactly explain why.

The answer is that often there is no difference at all.

Whether you write the query using IN or EXISTS is frequently just a semantic choice: the queries are identical, from a logical perspective. The SQL Server Query Optimizer contains extensive rules that are able to transform equivalently-written textual statements to the same, optimal, executable form.

Take Jeff's tests. The IN and EXISTS forms both compile to exactly the same query plan - a plan which we might expect to be produced from a quite different written form:

WITH DistinctSomeInt

AS (

-- Group the lookup values in the 'IN' or 'EXISTS'

-- target table. No unique index or primary key

-- exists to enforce unique values, so based

-- on statistics, the optimizer may decide to do

-- a GROUP BY to remove any duplicates. Removing

-- duplicates from the lookup list won't affect

-- the final result, but will reduce the number

-- of joins...

-- Returning the rows in index order allows

-- an efficient Stream Aggregate to remove

-- duplicates (any duplicates are guaranteed to be

-- adjacent in the data stream)

SELECT MIT.SomeInt

FROM dbo.MyInTable MIT

WITH (INDEX(IX_MyInTable_SomeInt))

GROUP BY

MIT.SomeInt

),

OrderedTestTable

AS (

-- Access the columns we need for the output

-- from the base table, using the covering

-- index, returning rows in index order

-- (Ordered: True)

SELECT JBMT.RowNum,

JBMT.SomeInt

FROM dbo.JBMTest JBMT

WITH (INDEX(IX_JBMTest_SomeInt))

)

SELECT OrderedTestTable.RowNum,

OrderedTestTable.SomeInt

FROM DistinctSomeInt

-- Enforce a loop join, and force the join order

INNER

LOOP

JOIN OrderedTestTable

ON OrderedTestTable.SomeInt = DistinctSomeInt.SomeInt

OPTION (FORCE ORDER, ORDER GROUP); -- ORDER GROUP to enforce the Stream Aggregate

It's quite amazing how much work the optimizer does on your behalf sometimes 🙂

So, the correct answer to the original question is possibly: "Mu!"

See Craig Freedman's excellent blog for further examples of transformations and other plan optimizations available in SQL Server. The link there will take you to a quite appropriate example of transforming a semi join (EXISTS) to an INNER JOIN.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi