Subquery and Join (Speed)

  • most time, if i want to retrieve a value from other i always use subquery, but if i want many value(more than two column) from other column i use join. but i don't know deeply about how different between both method, especially the speed (while the table become larger and larger).

    what is the correct way should i use?

    i know both type have their own advantage, what?

    thank for supporting idea and documents,

  • There's virtually no difference between a JOIN and an IN (subquery)... they will both create the same execution plan and execute within milliseconds of each other... unless there's a correlation in the sub-query... then, it depends on whether you've used an equality in the correlation or not. If you've used an inequality based on something temporal (IDENTITY or DATETIME column, etc), then you could end up with a "Triangular Join"... a form of RBAR on steroids.

    Please see the following article...

    http://www.sqlservercentral.com/articles/T-SQL/61539/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I read somewhere (i think it was AskTom) but depending on how you wanted your data or the business logic (the way you write your query if using a subquery or join) your results may vary...

    I'm not the expert, but I just remember reading it and seeing the example, so I thought i would share that, not hoping to confuese...lol

    John

  • jsteinbeck (10/31/2008)


    I read somewhere (i think it was AskTom) but depending on how you wanted your data or the business logic (the way you write your query if using a subquery or join) your results may vary...

    I'm not the expert, but I just remember reading it and seeing the example, so I thought i would share that, not hoping to confuese...lol

    John

    First, "Ask Tom" is about Oracle, not SQL Server... two different engines... two different worlds.

    INNER JOIN and WHERE IN are exactly the same in SQL Server... there's a ton of posts on these forums that prove it. Some have shown that certain uses (not many) of WHERE EXISTS can be a bit faster, but normally the rest of the code is such crap, folks should concentrate more on writing the other parts of the query correctly. Correllated SUBqueries that are based on equalities will also normally resolve to an INNER JOIN... sometimes they resolve to something as bad or far worse than a cursor.

    Of course, having the wrong criteria in an INNER JOIN can also be far worse than a cursor...

    ... and if you use a cursor to process RBAR instead of using it to control the processing of sets of information, then you're doing something wrong. Chances are, if you're using a CLR (with very few exceptions like RegExReplace), you're also doing something wrong.

    Ok... duck! Here come the zealots that will insist you can use a cursor or CLR if you're pressed for time or don't know T-SQL well enough to do it the right way. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • if i use correlated subquery, and join to two table.

    what way should i choose for optimal performance when there large number of rows.

    For Example

    --************Retrieve a value from any one-table *****************************

    --JOIN

    select col.* , obj.Name

    from ( syscolumns as col inner join sysobjects as obj

    on col.id=obj.id)

    order by obj.Name

    --Correlated SubQuery

    select col.* , (Select obj.Name from sysobjects as obj where obj.id=col.id) as XXName

    from syscolumns as col

    Order by xxName

    --************Retrieve a value from any one-table *****************************

    --JOIN

    select col.*

    , obj.Name

    ,type.Name as TypeName

    from ( syscolumns as col inner join sysobjects as obj on col.id=obj.id)

    inner join systypes as type on col.xtype=type.xtype and type.xtype <>321

    order by obj.Name

    --SubQuery

    select

    col.*

    ,(Select obj.Name from sysobjects as obj where obj.id=col.id) as XXName

    ,(select type.Name from systypes as type where type.xtype=col.xtype and type.xtype<>231) as TypeName

    from syscolumns as col

    Order by xxName

    thank for your reply

  • ry.rith (10/31/2008)


    if i use correlated subquery, and join to two table.

    what way should i choose for optimal performance when there large number of rows.

    For Example ...

    For the type of queries you wrote, there's no question that the INNER JOIN method would be the one that I would use. But, just to demonstrate, let's build a larger test table than the 4000-5000 rows normally available in the Master database of SQL Server 2000 and see... 100,000 rows should do it...

    --=============================================================================

    -- Setup to test on more data than available in SysColumns

    --=============================================================================

    --===== Identify the database to use (NOTHING will be written here)

    USE Master

    --===== Create a larger test table than SysColumns -- 22 Seconds

    SELECT TOP 100000 sc1.*

    INTO #BigTable

    FROM dbo.SysColumns sc1

    CROSS JOIN dbo.SysColumns sc2

    --===== Give it the best index we can

    CREATE CLUSTERED INDEX CX_BigTable_ID ON #BigTable (ID) -- 4 seconds

    ... and, now... your first set of queries... run them (in the same window as above) with and without the ORDER BY and study both the Estimated and Actual Execution Plans... IGNORE THE % OF BATCH because it lies, really bad sometimes...

    --=============================================================================

    -- These are the 1st set of tests like what you had in your example code

    --=============================================================================

    --===== Begin measuring durations

    SET STATISTICS TIME ON

    --************Retrieve a value from any one-table *****************************

    --===== JOIN

    SELECT col.*,

    obj.Name

    FROM #BigTable col

    INNER JOIN dbo.SysObjects obj ON col.ID = obj.ID

    ORDER BY obj.Name

    PRINT REPLICATE('=',80)

    --===== Correlated SubQuery

    SELECT col.*,

    (SELECT obj.Name FROM dbo.SysObjects obj WHERE obj.ID = col.ID) AS XXName

    FROM #BigTable col

    ORDER BY xxName

    --===== Turn off duration measurements

    SET STATISTICS TIME OFF

    Here's the performance output I get on my box (SQL Server 2000 Developer's Edition). Clearly, the INNER JOIN method wins especially where precious CPU time is involved...

    [font="Courier New"]SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (100000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 891 ms, elapsed time = 5684 ms.

    ================================================================================

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 13 ms.

    (100000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 2000 ms, elapsed time = 6250 ms.[/font]

    Now... let's do something similar with your second set of queries... are you in for a surprise!

    --=============================================================================

    -- These are the 2nd set of tests like what you had in your example code

    --=============================================================================

    --===== Begin measuring durations

    SET STATISTICS TIME ON

    --************Retrieve a value from any one-table *****************************

    --===== JOIN

    SELECT col.*,

    obj.Name,

    type.Name AS TypeName

    FROM #BigTable col

    INNER JOIN dbo.SysObjects obj ON col.ID = obj.ID

    INNER JOIN dbo.SysTypes type ON col.XType = type.XType AND type.XType <> 231

    ORDER BY obj.Name

    PRINT REPLICATE('=',80)

    --===== JOIN with proper use of criteria

    SELECT col.*,

    obj.Name,

    type.Name AS TypeName

    FROM #BigTable col

    INNER JOIN dbo.SysObjects obj ON col.ID = obj.ID

    INNER JOIN dbo.SysTypes type ON col.XType = type.XType

    WHERE type.XType <> 231

    ORDER BY obj.Name

    PRINT REPLICATE('=',80)

    --===== Correlated SubQuery (returns a different number of rows than either of the above)

    SELECT col.*,

    (SELECT obj.Name FROM dbo.SysObjects obj WHERE obj.ID = col.ID) AS XXName,

    (SELECT type.Name FROM dbo.SysTypes type WHERE type.XType = col.XType AND type.XType <> 231) AS TypeName

    FROM #BigTable col

    ORDER BY xxName

    --===== Turn off duration measurements

    SET STATISTICS TIME OFF

    First, notice that I added an extra code snippet to put the filter criteria where it belongs. As you can see below, it DOES make a difference in performance...

    [font="Courier New"]

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (95044 row(s) affected)

    SQL Server Execution Times:

    CPU time = 2828 ms, elapsed time = 7660 ms.

    ================================================================================

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (95044 row(s) affected)

    SQL Server Execution Times:

    CPU time = 2672 ms, elapsed time = 7343 ms.

    ================================================================================

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (100000 row(s) affected)

    SQL Server Execution Times:

    CPU time = 2094 ms, elapsed time = 6986 ms.

    [/font]

    Hmmm.... also notice that you're query returns ALL the rows instead of being filtered... that's a common mistake when folks use sub-queries... the criteria in a correlated subquery in the Select list has NO effect on the number of rows returned.

    Now, for the biggest "surprise" of them all... look at the output of the 3 queries above... see anything wrong? Anything horribly wrong? Heh... [font="Arial Black"]the 3rd output is NOT sorted correctly even though there is a very obvious ORDER BY in the code![/font] That's because the optimizer processes by COLUMN... not by ROW and it resolves the RBAR join between #BigTable and SysObjects first and sorts it and then finally resovles the RBAR join between that result and SysTypes which, of course, results in the wrong sort order. That's also why it appears that the correlated sub-query is faster... the sort isn't working quite as hard.

    The bottom line here is that having correlated sub-queries in the Select list are a form of RBAR and databases just weren't meant to use RBAR. Because of that, correlated sub-queries can sometimes create unexpected results and, in other cases, can be a lot slower than good set-based joins.

    My recommendation would be this... with maybe 1 or 2 very special case exceptions, don't use correlated sub-queries in the Select list... and since a correlated sub-query with equalities as criteria in the WHERE clause resolve to an INNER JOIN, there almost never a reason to use them in the WHERE clause either. The exception to that might be that WHERE NOT EXISTS and WHERE NOT IN resolve a bit faster than a LEFT OUTER JOIN with an ISNULL criteria to do the same thing.

    And, don't try to justify the use of correlated sub-queries based on row counts. Estimated usage can change drastically and, even if they don't, some poor slob will copy your code and use it on something with much larger row counts.

    Correlated sub-queries ARE a form a RBAR and should be avoided at all costs with only 1 or 2 exceptions that I've ever done. The exception that I've used is finding "gaps" in serial numbers (heh... don't ask ;)).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, are you putting your articles in thread posts again? 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Heh... it would appear so... maybe this one would make a good article. Thanks, Barry.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Why can't you publish it as an article ? So that it will be useful to lot of folks like me.

    karthik

  • Hi Jeff, thanks for the excellent explanation, much appreciated. I prefer replies like these compared to the - ya perhaps, maybe, try this, depends.

    Regards

    Kevin

    sonyt65@yahoo.com

  • clive (11/3/2008)


    Hi Jeff, thanks for the excellent explanation, much appreciated. I prefer replies like these compared to the - ya perhaps, maybe, try this, depends.

    Regards

    Kevin

    sonyt65@yahoo.com

    When was the last time when yoursel gave such explanation to somebody else?

    😉

    _____________
    Code for TallyGenerator

  • I appreciate the comments guys. Thank you very much. Perhaps I will write it up as an article with a couple of more "gotcha" examples... Thanks again.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • In case of sub query sql server affects in speed so as less as possible use this best use is join

    Ranjan

    Developer in http://www.Hamarashehar.com

  • Please note: 3 year old thread.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply