Semi joins

  • Well here is something interesting. Unfortunately, I'm still learning about the underlying database engine (I'm only half way through Karen Delaney's excellent Query Tuning and Optimization book for SQL Server 2005...) but this is what I'm seeing when I use Jeff's test data tables (btw, thanks Jeff!):

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT o.SomeInt

    FROM #MyInTable AS o

    WHERE o.SomeInt in

    (

    SELECT top 1 p.SomeInt FROM JBMTest AS p

    WHERE p.SomeInt = o.SomeInt

    )

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    (1000 row(s) affected)

    Table 'Worktable'. Scan count 1000, logical reads 2920331, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'JBMTest'. Scan count 1, logical reads 15186, physical reads 0, read-ahead reads 0, lob logical reads 0,

    lob physical reads 0, lob read-ahead reads 0.

    Table '#MyInTable ___ 000000000081'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 4469 ms, elapsed time = 4478 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    SELECT o.SomeInt

    FROM #MyInTable AS o

    WHERE exists --o.SomeInt in

    (

    SELECT top 1 p.SomeInt FROM JBMTest AS p

    WHERE p.SomeInt = o.SomeInt

    )

    SET STATISTICS IO OFF

    SET STATISTICS TIME OFF

    (1000 row(s) affected)

    Table '#MyInTable___000000000081'. Scan count 9, logical reads 7, physical reads 0, read-ahead reads 0,

    lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'JBMTest'. Scan count 9, logical reads 15296, physical reads 0, read-ahead reads 0, lob logical reads 0,

    lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0,

    lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 406 ms, elapsed time = 59 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 1 ms.

    Look at the logical reads on the first query as compared to the second query!!!!

    Random Technical Stuff[/url]

  • Ummmm... apologies, it seems I've accidently widened the page 🙁

    Anyone know how to fix the post?

    Random Technical Stuff[/url]

  • You'd have to shorten the lines in the code box you made... but you're ok.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Fixed... sorry about that.

    I'll post some observations tomorrow.

    Random Technical Stuff[/url]

  • One thing that I noticed is that you have reversed the roles of the tables in Jeff's example. Originally, Jeff had #MyInTable as the subordinate table, but you are using it as the primary table.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (12/30/2008)


    One thing that I noticed is that you have reversed the roles of the tables in Jeff's example. Originally, Jeff had #MyInTable as the subordinate table, but you are using it as the primary table.

    I did this at work... I'll go back to my workstation tomorrow and have a look to see I did everything correctly... but I thought that's what I said 🙁 100,000 people in the OrgUnit... where #MyInTable is 1000 records (thus it is OrgUnit) and 100,000 records for JBMTest (Person table).

    I may have been a tad unclear.

    Random Technical Stuff[/url]

  • Peculiar... on our 2008 development box (which is pretty quiet this week), I ran Jeff's original test and all three ran almost neck and neck. EXISTS squeaked out a win over Join and IN came in third.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (12/30/2008)


    Peculiar... on our 2008 development box (which is pretty quiet this week), I ran Jeff's original test and all three ran almost neck and neck. EXISTS squeaked out a win over Join and IN came in third.

    Whew! Glad I'm not the only one!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • It's interesting looking at the execution plans for the two queries:

    SELECT o.SomeInt

    FROM #MyInTable AS o

    WHERE o.SomeInt in

    (

    SELECT top 1 p.SomeInt FROM JBMTest AS p

    WHERE p.SomeInt = o.SomeInt

    )

    |--Nested Loops(Left Semi Join, OUTER REFERENCES: ([o].[SomeInt]))

    |--Index Scan(OBJECT: ([tempdb].[dbo].[#MyInTable] AS [o]))

    |--Filter(WHERE: ([ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]=

    [tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

    |--Top(TOP EXPRESSION: ((1)))

    |--Index Spool(SEEK: ([p].[SomeInt]=

    [tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

    |--Clustered Index Scan

    (OBJECT: ([ISS_V8_CS].[dbo].[JBMTest].[PK__JBMTest__7EAD8B99] AS [p]))

    SELECT o.SomeInt

    FROM #MyInTable AS o

    WHERE EXISTS

    (

    SELECT top 1 p.SomeInt FROM JBMTest AS p

    WHERE p.SomeInt = o.SomeInt

    )

    |--Parallelism(Gather Streams)

    |--Hash Match(Left Semi Join, HASH: ([o].[SomeInt])=([p].[SomeInt]),

    RESIDUAL: ([ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]

    =[tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

    |--Bitmap(HASH: ([o].[SomeInt]), DEFINE: ([Bitmap1004]))

    | |--Parallelism(Repartition Streams, Hash Partitioning,

    PARTITION COLUMNS: ([o].[SomeInt]))

    | |--Index Scan(OBJECT: ([tempdb].[dbo].[#MyInTable] AS [o]))

    |--Parallelism(Repartition Streams, Hash Partitioning,

    PARTITION COLUMNS: ([p].[SomeInt]), WHERE: (PROBE([Bitmap1004])=TRUE))

    |--Clustered Index Scan

    (OBJECT: ([ISS_V8_CS].[dbo].[JBMTest].[PK__JBMTest__7EAD8B99] AS [p]))

    So what I'm seeing is that a hash match is being run on the WHERE EXISTS version, and a nested loop is being used for the IN version. Because I've not set MAXDOP hint to 0 in the WHERE EXISTS version, this is trying to setup parallelism. I'm not really clear what the Bitmap operator is doing, though.

    My understanding of things is that high degrees of parallelism is in general a good thing for data warehouses, but not necessarily something you want to have occur in a heavy use OLTP situation. However, I also thought that a hash match is probably a better idea for large amounts of data, but then again the nested join doesn't block where the hash match does.

    So I guess it's all a massive trade off, and depends on your environment. For a heavy use OLTP system, I think I'd use the IN statement for the semi join, or alternatively if it was a large amount of data (as in the case of 100,000 records in the outer query) I'd use the WHERE EXISTS version but with a restricted MAXDOP hint.

    What do people thinK?

    Random Technical Stuff[/url]

  • OK, now I'm slightly confused. It actually looks like the TOP 1 in the IN statement does make a difference!

    Check out the execution plan:

    SELECT o.SomeInt

    FROM #MyInTable AS o

    WHERE o.SomeInt in

    (

    SELECT p.SomeInt FROM JBMTest AS p

    WHERE p.SomeInt = o.SomeInt

    )

    |--Parallelism(Gather Streams)

    |--Hash Match(Left Semi Join, HASH: ([o].[SomeInt])=([p].[SomeInt]),

    RESIDUAL: ([ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]=

    [tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]

    AND [ISS_V8_CS].[dbo].[JBMTest].[SomeInt] as [p].[SomeInt]=

    [tempdb].[dbo].[#MyInTable].[SomeInt] as [o].[SomeInt]))

    |--Bitmap(HASH: ([o].[SomeInt]), DEFINE: ([Bitmap1004]))

    | |--Parallelism(Repartition Streams, Hash Partitioning,

    PARTITION COLUMNS: ([o].[SomeInt]))

    | |--Index Scan(OBJECT: ([tempdb].[dbo].[#MyInTable] AS [o]))

    |--Parallelism(Repartition Streams, Hash Partitioning,

    PARTITION COLUMNS: ([p].[SomeInt]), WHERE: (PROBE([Bitmap1004])=TRUE))

    |--Clustered Index Scan(

    OBJECT: ([ISS_V8_CS].[dbo].[JBMTest].[PK__JBMTest__7EAD8B99] AS [p]))

    This exhibits pretty much the same behaviour now as the WHERE EXISTS query!

    I would love to know why this occurs.

    Random Technical Stuff[/url]

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply