EXISTS queries that can't be written any other way

  • Hello. All examples of the EXIST operator I know of use correlated subqueries that could be also written as a join with no performance gain or loss. Can anyone give me an example of a subquery that can only be written using EXISTS?

    Thank you!

  • The simple example is if the subquery returns more than one row for the joining condition.

    If you use a join, the results contain extra rows.

    EXISTS will not create the extra rows.

    You can certainly eliminate the extra rows with a GROUP BY or DISTINCT, but that almost certainly will cause a performance hit.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thanks for the reply. Sounds perfectly reasonable, except that when I compare the 2 queries, one written with the EXISTS operator and a correlated subquery and the other as a join, the execution plans and costs are identical.

  • http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/

    http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I think that it's always possible to rewrite an EXISTS query with an INNER JOIN or a CROSS APPLY.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Can you post some sample data and your queries?

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • Thanks, Gail and Drew. Gail, the links don't directly discuss EXISTS vs join, but it looks like one can conclude what Drew is saying that there is really no difference in performance, is that what you would say, too? Unless, perhaps the data set I'm using is not big enough to produce a noticeable difference in performance?

  • Michael, I'm using the AdventureWorks database and the following:

    USE AdventureWorks

    SELECT Name

    FROM Production.Product p

    WHERE EXISTS

    (SELECT *

    FROM Production.ProductSubcategory

    WHERE ProductSubcategoryID = p.ProductSubcategoryID

    AND Name = 'Wheels')

    SELECT p.Name

    FROM Production.Product p

    JOIN Production.ProductSubcategory psc

    ON psc.ProductSubcategoryID = p.ProductSubcategoryID

    WHERE psc.Name = 'Wheels'

  • peter 82125 (10/21/2015)


    Thanks, Gail and Drew. Gail, the links don't directly discuss EXISTS vs join, but it looks like one can conclude what Drew is saying that there is really no difference in performance, is that what you would say, too? Unless, perhaps the data set I'm using is not big enough to produce a noticeable difference in performance?

    Note, I did not say that there would be no difference in performance. I just said that they could be rewritten without using EXISTS. I've made assumptions about relative performance in the past and have learned to always test if I think that there might be a better performing alternative. I make no claims about the performance of tenuous queries.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Duly noted.

  • It may be rather academic, but there is at least one case where EXISTS is nicer than a JOIN.

    The potential (emphasis on potential) performance advantage that EXISTS has is the use of a semi join, which allows the subquery to effectively short-circuit when a match is found (on that note, the query plans you're seeing are likely not actually identical; while the physical join type might well be the same in both cases, the logical join type for the EXISTS query probably shows as a left semi join instead of an inner join).

    If there are very few matches, then you may see very little difference in performance, since there won't be much use of short-circuiting when a match is found. If there are many, many matches, then you'll start to see a bigger difference.

    Also, if the criterion for matching rows is an equality condition, then once you have enough rows, both the JOIN and the EXISTS will likely end up using hash joins and just scan each table once, again minimizing the difference.

    If the criterion for matching rows is not an equality condition and there are many matches, then you can start to see a big difference between EXISTS and the JOIN. Going back to what I said about this being somewhat academic, it's true that matching rows on a condition other than equality is rather uncommon compared to matching rows on equality. Still, this is a case where EXISTS can significantly outperform a JOIN.

    Below is some code to illustrate that (takes about a minute to run on my machine; if you have a beastly machine, it might finish much more quickly:-)):

    --Of course don't run the FREEPROCCACHE on your production server.

    --I'm just running it so my query stats come back more cleanly

    --on my lab server

    DBCC FREEPROCCACHE

    CREATE TABLE #BlackList (Pattern varchar(30))

    CREATE TABLE #Phrases (Phrase char(36))

    --Populate a bunch of "phrases"

    INSERT INTO #Phrases

    SELECT TOP 5000 CAST(NEWID() AS char(36))

    --Construct a bunch of rows

    FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t1(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t2(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t3(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t4(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t5(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t6(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t7(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t8(n)

    CROSS JOIN (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t9(n)

    --Populate a bunch of blacklisted patterns based on the phrases

    --we just created

    INSERT INTO #BlackList

    SELECT TOP 1000 '%'+SUBSTRING(Phrase,4,5)+'%'

    FROM #Phrases

    INSERT INTO #BlackList

    SELECT TOP 1000 '%'+SUBSTRING(Phrase,15,6)+'%'

    FROM #Phrases

    INSERT INTO #BlackList

    SELECT TOP 1000 '%'+SUBSTRING(Phrase,23,3)+'%'

    FROM #Phrases

    --I like to use EXEC just to make sure the query text doesn't change

    --just because I include a little extra white space or some such

    --nonsense. That makes sure I keep the stats under the same entry

    --in sys.dm_exec_query_stats.

    --I also like to prefix with "--QueryTest:" so I can find the query

    --stats easily and don't have to run SET STATISTICS or other

    --observer-effect-laden things

    --I'm also using MAXDOP 1 just to avoid any quirks from parallelism

    --Ok, let's find out how many of our phrases run afoul of the blacklist

    EXEC('--QueryTest: INNER JOIN

    SELECT COUNT(Phrase)

    FROM #Phrases Phr

    INNER JOIN #Blacklist Pat

    ON Phr.Phrase LIKE Pat.Pattern

    OPTION (MAXDOP 1)')

    EXEC('--QueryTest: EXISTS

    SELECT COUNT(*) FROM #Phrases Phr

    WHERE EXISTS (SELECT NULL FROM #Blacklist Pat WHERE Phr.Phrase LIKE Pat.Pattern)

    OPTION (MAXDOP 1)')

    --The INNER JOIN above gives us the wrong counts because some phrases

    --match multiple patterns, so we have to add a DISTINCT to remedy that

    EXEC('--QueryTest: DISTINCT/INNER JOIN

    SELECT COUNT(DISTINCT Phrase)

    FROM #Phrases Phr

    INNER JOIN #Blacklist Pat

    ON Phr.Phrase LIKE Pat.Pattern

    OPTION (MAXDOP 1)')

    --CROSS APPLY with TOP 1 is another option that can perform reasonably

    --under the right circumstances. Here it performs similarly to EXISTS.

    EXEC('--QueryTest: CROSS APPLY

    SELECT COUNT(Phrase)

    FROM #Phrases Phr

    CROSS APPLY (SELECT TOP 1 Pattern FROM #Blacklist Pat WHERE Phr.Phrase LIKE Pat.Pattern) X

    OPTION (MAXDOP 1)')

    SELECT

    qs.last_logical_reads,

    qs.last_worker_time/1000 AS CPU_ms,

    qs.last_elapsed_time/1000 AS Duration_ms,

    qs.last_execution_time, qp.query_plan, qt.text

    FROM sys.dm_exec_query_stats qs

    CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) qp

    CROSS APPLY sys.dm_exec_sql_text(qs.plan_handle) qt

    WHERE qt.text LIKE '--QueryTest%'

    ORDER BY last_execution_time desc

    DROP TABLE #Blacklist, #Phrases

    You can play around with the ratio between Phrases and matching Blacklist patterns to see how that affects the performance difference.

    Hopefully this helps! (Also, full disclosure, it's getting late and I'm severely under-caffeinated, so if something I've said seems ridiculous, it probably is, and I apologize in advance for that :-))

    Cheers!

  • Here is a related blog by Aaron Bertrand: sqlperformance.com/2012/12/t-sql-queries/left-anti-semi-join


    The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url]

    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs[/url]

  • peter 82125 (10/21/2015)


    Thanks, Gail and Drew. Gail, the links don't directly discuss EXISTS vs join, but it looks like one can conclude what Drew is saying that there is really no difference in performance, is that what you would say, too? Unless, perhaps the data set I'm using is not big enough to produce a noticeable difference in performance?

    There's a roundup that I didn't post the link for which compares all three. Still, EXISTS compared to IN and IN to join, you can extrapolate the result easily enough

    There is a performance difference, but it's minimal. It's not enough that I'd convert a join to EXISTS for performance reasons, but I will use EXISTS when I'm looking for whether there are rows or not and joins when I need to actually join.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • EXISTS is also safer than IN when you're dealing with a nullable column.

  • Thank you all so much for your informative replies! Jacob, you rock, caffeine or none! Really appreciate it 🙂

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply