Semi joins

  • I wrote a blog post (for my own reference) about the different types of joins that you can have in SQL.

    However, I'm interested in a particular aspect of Semi joins.

    Say you have a semi-join that uses the following syntax:

    [font="Courier New"]SELECT o.OrgUnitID

    FROM OrgUnit AS o

    WHERE EXISTS

    (

    SELECT p.OrgUnitID FROM Person AS p

    WHERE p.OrgUnitID = o.OrgUnitID

    )[/font]

    OK, so this will find all the org units that have a person in it. But let's say that you have 100,000 people in an Org unit. Would the correlated sub query bring back 100,000 rows? If so, would it be better to phrase the query as:

    [font="Courier New"]SELECT o.OrgUnitID

    FROM OrgUnit AS o

    WHERE EXISTS

    (

    SELECT top 1 p.OrgUnitID FROM Person AS p

    WHERE p.OrgUnitID = o.OrgUnitID

    )[/font]

    My thinking is that the correlated subquery would only bring back the first matching row, then stop scanning. If that's the case, then this would be more efficient.

    What do people think? Am I missing something?

    Random Technical Stuff[/url]

  • ta.bu.shi.da.yu (12/27/2008)


    I wrote a blog post (for my own reference) about the different types of joins that you can have in SQL.

    However, I'm interested in a particular aspect of Semi joins.

    Say you have a semi-join that uses the following syntax:

    [font="Courier New"]SELECT o.OrgUnitID

    FROM OrgUnit AS o

    WHERE EXISTS

    (

    SELECT p.OrgUnitID FROM Person AS p

    WHERE p.OrgUnitID = o.OrgUnitID

    )[/font]

    OK, so this will find all the org units that have a person in it. But let's say that you have 100,000 people in an Org unit. Would the correlated sub query bring back 100,000 rows? If so, would it be better to phrase the query as:

    [font="Courier New"]SELECT o.OrgUnitID

    FROM OrgUnit AS o

    WHERE EXISTS

    (

    SELECT top 1 p.OrgUnitID FROM Person AS p

    WHERE p.OrgUnitID = o.OrgUnitID

    )[/font]

    My thinking is that the correlated subquery would only bring back the first matching row, then stop scanning. If that's the case, then this would be more efficient.

    What do people think? Am I missing something?

    Based on my understanding of EXISTS, I think both queries are equivalent. Best thing you could do is test the queries against the large dataset you propose and compare the executions plans and performance between them.

  • WHERE IN and WHERE EXISTS naturally do the equivalent of a DISTINCT and both have identical execution plans and both take the same amount of time and resources to execute whether indexes exist or not. Of course, the correct indexes allow for better performance of both.

    If a DISTINCT isn't necessary, INNER JOIN usually ties WHERE IN and WHERE EXISTS for performance and resource usage. Only when you add DISTINCT does Inner Join lose, and then it's not by much.

    Of course, you shouldn't just up an take my word for it... test it yourself...

    Here's the test data...

    --DROP TABLE dbo.JBMTest,#MyInTable

    GO

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Create a table with some "WHERE IN" values

    SELECT TOP 1000

    IDENTITY(INT,1,1) AS RowNum,

    ABS(CHECKSUM(NEWID()))%50000+1 AS SomeInt

    INTO #MyInTable

    FROM Master.dbo.SysColumns t1

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE #MyInTable

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Add an index to it

    CREATE INDEX IX_MyInTable_SomeInt

    ON #MyInTable (SomeInt)

    Here's the test code...

    --===== THE TEST CODE

    DECLARE @Bitbucket INT

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT DISTINCT

    @Bitbucket = t.RowNum

    FROM dbo.JBMTest t

    INNER JOIN #MyInTable i

    ON t.SomeInt = i.SomeInt

    SELECT @Bitbucket = t.RowNum

    FROM dbo.JBMTest t

    WHERE t.SomeInt IN (SELECT SomeInt FROM #MyInTable)

    SELECT @Bitbucket = t.RowNum

    FROM dbo.JBMTest t

    WHERE EXISTS (SELECT 1 FROM #MyInTable i WHERE t.SomeInt = i.SomeInt)

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    ... and here's an additional index that you can test with...

    ----===== Add an index to the larger table

    CREATE INDEX IX_JBMTest_SomeInt

    ON dbo.JBMTest (SomeInt)

    My personal preferance is to never use WHERE EXISTS simply because it contains a correlated subquery. In this case, it doesn't hurt performance but I've seen where it has in the past. Further, correlated subqueries cannot be tested in a standalone fashion where a WHERE IN can be.

    WHERE IN is very tempting to use because it does an inherent DISTINCT (as does WHERE EXISTS) and because it does allow for separate testing where even an INNER JOIN might not. The problem is, many people abuse WHERE IN on lengthy code and it sometimes becomes a bit difficult to read on large queries. It also doesn't allow for multi-column joins like WHERE EXISTS and INNER JOIN do.

    I've also been known to make separate derived tables just for the purpose of testing and ease of trouble shooting and then inner join the derived tables. The code becomes VERY easy to read and VERY easy to troubleshoot or modify. And, there's no change in performance in doing so except it's occasionally better (or at least, that's my perception).

    So, bottom line for me is that I'll use WHERE IN only when trying to do a comparison with constants and I use INNER JOIN for everything else even if a DISTINCT is involved. Yes, the code does get a little longer... but the readability and ease of modification make it all worth while on complex code.

    Heh... at least for me, it does. This is all very subjective to what boils down to being a huge personal preference. No flaming on this one, folks... 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just to be clear - EXISTS does not return any rows at all. Exists stops processing as soon as a match is found. The EXISTS query may be slightly more efficient because it does not have to return any data or perform any matching.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • But, it's not... run the test code I made.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • No need to run the test code - I did say *may*...;)

    Now, if you look at what the OP posted - you'll see that the WHERE IN is also a correlated sub-query. In this case, the correlated sub-query and the TOP 1 essentially does the same thing as the EXISTS. This may not be as efficient - then again, it may.

    I also prefer using WHERE IN when there are constant values to check. I get a bit more concerned when I see:

    WHERE column IN (SELECT somecolumn FROM sometable)

    If the sub-query in the above can return a null value at all - the IN will fail. At this point, you have to modify the query to eliminate NULLs, which is still usually better than moving to a correlated sub-query using EXISTS - but is more often missed when troubleshooting.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Heh... that's why I wrote the test code... I don't have to say *may* that way. 😛

    I just didn't want people to walk away from this post thinking that the myth of performance using WHERE EXISTS had any validity to it, whatsoever.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeffrey Williams (12/28/2008)


    Just to be clear - EXISTS does not return any rows at all. Exists stops processing as soon as a match is found. The EXISTS query may be slightly more efficient because it does not have to return any data or perform any matching.

    Jeff Moden (12/28/2008)But, it's not... run the test code I made.

    Uh, I did, and the EXISTS blows the others away. I even tried it on SQL server 2008 with the same results.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (12/28/2008)


    I just didn't want people to walk away from this post thinking that the myth of performance using WHERE EXISTS had any validity to it, whatsoever.

    I am not sure what myth you mean here, Jeff, but what I do know is that I have never seen a case where an EXISTS was consistently slower than the equivalent JOIN in SQL 2005 or 2008. And I have seen many cases (like this one) where the JOIN is slower.

    SQL Server 2000, is a different story of course. Back when I was using it, I saw many such cases which IMHO were a performance bug in SQL 2000. And that, I think, was the cause of the myth that correlated subqueries were necessarily a bad thing, when really, there is no logical reason why they should be any slower than an equivalent correlated join.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Huh... on my machine, WHERE EXISTS ties the INNER JOIN. Guess it goes to show... "It Depends".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hmmmm... I think I'm going to definitely test this one out myself!

    When I do I'll post my results.

    Random Technical Stuff[/url]

  • ta.bu.shi.da.yu (12/29/2008)


    Hmmmm... I think I'm going to definitely test this one out myself!

    When I do I'll post my results.

    Now that is something that I think Jeff & I will both agree on. 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • I think the "It depends" plays on the level of correlation between the two data sources. That is - if there is a HIGH level of correlation, then the join starts to edge out the EXISTS. If on the other hand - the number of items in common is low, then EXISTS tends to get faster.

    That being said - I do recall it being a LITTLE faster for the joins, versus a LOT faster for the EXISTS. Meaning - on the whole, when in doubt, EXISTS was a better choice since it tended to perform just a little bit better (or rather didn't seem to have really bad perf areas, which was not true for the join).

    I did some tests on this a while back. I will see if I can dig out the code for it.....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • It always depends.

    I would tend to use the EXISTS clause if I could. I think there are shortcuts at times where it bails early and is more efficient. It really depends on the logic you're trying to implement.

  • Heh... I guess I don't use WHERE EXISTS for the same reason I don't use Try/Catch... I know the database and I know the DRI... when I join on something, it's like Ragu sauce... "It's In There!" 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply