Which is better

  • HI All,

    Which would you consider to be a better query:

    I would go with query 2 cause it uses a join rather than like in query one it uses a correlated subquery

    What are you thoughts.

    --QUERY 1

    Select Count(*) as ClientsWithoutOrder

    from Client A

    Where not exists (Select *

    From Order B

    Where B.ClientId = A.Id

    )

    --QUERY 2

    Select Count(*) as ClientsWithoutOrder

    from Client A

    LEFT JOIN Order B

    ON B.ClientId = A.Id

    WHERE B.ClientId IS NULL

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • In this case (and I tested out in adventureworks) the exists is better because by using that you're telling SQL that it doesn't actually need to get the rows from the Order, all you're interested in is the presence or absence of a match. SQL does what's called a semi-join (or, in this particular case, an anti-semi-join)

    Whereas, in the join, it has to fetch the data from orders and join it in and then filter out the unwanted ros afterwards.

    As always, running it is the proof. I can attach exec plans if anyone wants.

    Query 1 shows a scan on customers, a scan and stream aggregate on order details, then a merge join (left anti-semi-join) and another stream aggregate. QO puts that at 48% cost

    Query 2 shows scans of customers and sales order header, then a merge join (left outer join), a filter and finally a stream aggregate. QO puts that at 52% of the batch cost.

    USE Adventureworks

    GO

    Query 1

    Select Count(*) as ClientsWithoutOrder

    from Sales.Customer A

    Where not exists (Select *

    From Sales.SalesOrderHeader B

    Where B.CustomerID = A.CustomerID

    )

    --QUERY 2

    Select Count(*) as ClientsWithoutOrder

    from Sales.Customer A

    LEFT JOIN Sales.SalesOrderHeader B

    ON B.CustomerID = A.CustomerID

    WHERE B.CustomerID IS NULL

    Query 1:

    Table 'SalesOrderHeader'. Scan count 1, logical reads 45

    Table 'Customer'. Scan count 1, logical reads 123

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 14 ms.

    Query 2:

    Table 'SalesOrderHeader'. Scan count 1, logical reads 45

    Table 'Customer'. Scan count 1, logical reads 123

    SQL Server Execution Times:

    CPU time = 78 ms, elapsed time = 71 ms.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Gail,

    Now that is what I call taking the time to explain something with proof πŸ™‚

    I could see from the plan it was better but just wasn't sure why..

    thanks again that's an awesome reply πŸ™‚

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • Gail, did you notice in your testing if there was significant benefit in using

    Where not exists (Select 1

    instead of Where not exists (Select *

    in Query 1? I'm sure I've come across this recently, with SELECT 1 claiming top performance.

    A significant difference between the two statements is that Query 1 will always preserve cardinality of the Sales table, whereas Query 2 would typically require a GROUP BY CustomerID.

    Cheers

    ChrisM

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Glad Gail answered this one. I have always personally preferred option 2 because I find the LEFT JOIN syntax easier to read, but I have learned that the EXISTS/NOT EXISTS performs better for the reasons Gail explains (better than I ever could have).

    I believe as used in an EXISTS/NOT EXISTS that there is no performance difference between Select * and Select 1. I read it somewhere, but can't remember where. I believe this is because of what Gail said:

    the exists is better because by using that you're telling SQL that it doesn't actually need to get the rows from the Order, all you're interested in is the presence or absence of a match. SQL does what's called a semi-join (or, in this particular case, an anti-semi-join)

  • I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.

  • Chris Morris (12/1/2008)


    Gail, did you notice in your testing if there was significant benefit in using

    Where not exists (Select 1

    instead of Where not exists (Select *

    in Query 1? I'm sure I've come across this recently, with SELECT 1 claiming top performance.

    There's no performance difference. 1

    Exists doesn't return any of the columns and what column you specify is immaterial. I use 1 just to make it very clear to me and to anyone reading that there's no column access and the column (if any) mentioned is immaterial. It's a readability thing.

    (1). During query execution, the expansion of select * to the list of columns comes before the removal of unnecessary columns, like in the exists clause. Hence there may be a minuscule metadata overhead. Certainly not enough to worry about though.

    Ref: http://sqlskills.com/BLOGS/CONOR/post/EXISTS-Subqueries-SELECT-1-vs-SELECT-*.aspx

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for the explanation, Gail.

    I use 1 just to make it very clear to me and to anyone reading that there's no column access and the column (if any) mentioned is immaterial. It's a readability thing.

    I like this too. The little guy perches on your shoulder and nags you to make the code more readable as you go along. It doesn't take up any more time but boy does it make a difference.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I learned about using conventional joins by reading these articles.

    Joins Clarified By Sanket Naik http://www.sqlservercentral.com/articles/Miscellaneous/joinsclarified/1333/[/url]

    A Refresher on Joins By Jambu Krishnamurthy http://www.sqlservercentral.com/articles/Basic+Querying/2937/[/url]

    Hidden RBAR: Triangular Joins By Jeff Moden http://www.sqlservercentral.com/articles/T-SQL/61539/[/url]

    I found this article on how Oracle uses semi-joins and anti semi-joins.

    Speeding Up Queries with Semi-Joins and Anti-Joins by Roger Schrag http://www.dbspecialists.com/files/presentations/semijoins.html

    I am ready for the next step. Can anyone recommend SQL Server Central article(s) for learning about the differences/similarities of semi-joins, INTERSECT, EXISTS, anti-semi-joins, and EXCEPT? I do not know when to use each and don't know how only for (anti) semi-joins.

    Paul DB

  • You can't explicitly use the semi-joins. They're used by SQL when it processes an IN or an EXISTS. Both of those require that the contents be checked, but not joined, hence the term semi-join.

    INTERSECT and Except aren't joins. They're used the same way as union/union all, to operate on two result sets.

    Intersect has the rows in common between the two (it's a set intersection). Except gives all rows in one resultset that aren't in the second. (It's a set subtraction)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Steve Jones - Editor (12/1/2008)


    I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.

    While I know this is a matter of opinion, I found 1 to be easier to read. It shows that the subquery is there merely as a conditional and it makes it obvious that it returns no data. I have run into many times when modifying procedures written by others that knowing whether or not a join table returns data or is only used for a conditional (or in some cases, just got left there pointlessly when someone else modified it earlier) can matter a great deal.

    In 2, it is less obviously that the second table is just a conditional.

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

  • timothyawiseman (12/5/2008)


    Steve Jones - Editor (12/1/2008)


    I'm with Jack. #2 is easier to read, but I'd suspect that #1 works better.

    While I know this is a matter of opinion, I found 1 to be easier to read. It shows that the subquery is there merely as a conditional and it makes it obvious that it returns no data. I have run into many times when modifying procedures written by others that knowing whether or not a join table returns data or is only used for a conditional (or in some cases, just got left there pointlessly when someone else modified it earlier) can matter a great deal.

    In 2, it is less obviously that the second table is just a conditional.

    I see what you are saying, but I "grew up" using the Left Join and Where column is null so I see the exclusion immediately. I never really used EXISTS/NOT EXISTS until the last year or so, thus I am just getting used to it.

  • Christopher Stobbs (12/1/2008)


    HI All,

    Which would you consider to be a better query:

    I would go with query 2 cause it uses a join rather than like in query one it uses a correlated subquery

    What are you thoughts.

    --QUERY 1

    Select Count(*) as ClientsWithoutOrder

    from Client A

    Where not exists (Select *

    From Order B

    Where B.ClientId = A.Id

    )

    --QUERY 2

    Select Count(*) as ClientsWithoutOrder

    from Client A

    LEFT JOIN Order B

    ON B.ClientId = A.Id

    WHERE B.ClientId IS NULL

    Alright... how'd ya do it? How'd you get that right parenthesis to hang out in mid air like that without it turning into a smiley face?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Actually, I'll take door #3 πŸ˜› ... performance is the same as WHERE EXISTS and it's even easier to read especially for newbies that may be on the team...

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    --===== QUERY 1

    SELECT COUNT(*) AS ClientsWithoutOrder

    FROM #Client c

    WHERE NOT EXISTS (SELECT *

    FROM #Order o

    WHERE c.Id = o.ClientId)

    --===== QUERY 2

    SELECT COUNT(*) AS ClientsWithoutOrder

    FROM #Client c

    LEFT OUTER JOIN #Order o

    ON c.Id = o.ClientId

    WHERE o.ClientId IS NULL

    --===== QUERY 3

    SELECT COUNT(*) AS ClientsWithoutOrder

    FROM #Client c

    WHERE c.ID NOT IN (SELECT b.ClientID

    FROM #Order b)

    SET STATISTICS TIME OFF

    SET STATISTICS IO OFF

    Heh... Yes, of course I tested it. You can, too! Here're the data generators...

    --===== Create and populate a 1,000,000 row ORDER test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "ClientID" has a range of 1 to 50,000 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    ClientID = ABS(CHECKSUM(NEWID()))%50000+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO #Order

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== A table is not properly formed unless a Primary Key has been assigned

    ALTER TABLE #Order

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Add an index on ClientID

    CREATE INDEX IX_#Order_ClientID

    ON #Order (ClientID)

    --===== Ensure that 1000 clients have no entries

    DELETE #Order

    WHERE ClientID % 50 = 0

    --===== Create a CLIENT test table containing every possible client

    SELECT TOP 50000

    IDENTITY(INT,1,1) AS ID

    INTO #Client

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== Add the primary key

    ALTER TABLE #Client

    ADD PRIMARY KEY CLUSTERED (ID)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry... forgot to post the results I got... single P4 1.8Ghz, 1GB RAM, 80GB IDE hard drive, SQL Server 2005 Dev Edition SP2...

    (1 row(s) affected)

    Table '#Client_____________________________________________________________________________________________________________00000000006E'.

    Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Order______________________________________________________________________________________________________________00000000006D'.

    Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 563 ms, elapsed time = 602 ms.

    ================================================================================

    (1 row(s) affected)

    Table '#Order______________________________________________________________________________________________________________00000000006D'.

    Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Client_____________________________________________________________________________________________________________00000000006E'.

    Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 641 ms, elapsed time = 669 ms.

    ================================================================================

    (1 row(s) affected)

    Table '#Client_____________________________________________________________________________________________________________00000000006E'.

    Scan count 1, logical reads 83, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Order______________________________________________________________________________________________________________00000000006D'.

    Scan count 1, logical reads 1738, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 562 ms, elapsed time = 601 ms.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply