Distinct going slower than *

  • ......New problem query posted on page 4.....

    Link -> http://www.sqlservercentral.com/Forums/FindPost669302.aspx

    Thanks in advance for any help you can give.

  • 1. What happens when you compare like with like?

    -- Keeping dupes:

    SELECT t.EMP_ID, t.EMPLOYEE_NAME

    FROM dbo.IREP_V_DAILY_COUNTS_TERRITORY t

    WHERE t.EMP_ID IS NOT NULL

    AND t.WORK_DATE BETWEEN '01/01/2009' AND '01/31/2009'

    ORDER BY t.EMPLOYEE_NAME

    -- Using DISTINCT

    SELECT DISTINCT t.EMP_ID, t.EMPLOYEE_NAME

    FROM dbo.IREP_V_DAILY_COUNTS_TERRITORY t

    WHERE t.EMP_ID IS NOT NULL

    AND t.WORK_DATE BETWEEN '01/01/2009' AND '01/31/2009'

    ORDER BY t.EMPLOYEE_NAME

    -- Using GROUP BY

    SELECT t.EMP_ID, t.EMPLOYEE_NAME, count(*)

    FROM dbo.IREP_V_DAILY_COUNTS_TERRITORY t

    WHERE t.EMP_ID IS NOT NULL

    AND t.WORK_DATE BETWEEN '01/01/2009' AND '01/31/2009'

    GROUP BY t.EMP_ID, t.EMPLOYEE_NAME

    ORDER BY t.EMPLOYEE_NAME

    2. Is IREP_V_DAILY_COUNTS_TERRITORY a view or a table?

    Edit: removed DISTINCT from first query.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Putts (2/19/2009)


    The first one will complete in 3 seconds. The second one will take 1min 17sec.

    No surprise. Distinct means "Sort (or hash) the entire result set, eliminate duplicate rows and return the unique rows"

    I do a distinct on the EMP_ID column.

    Distinct does not apply to a column. Distinct applies to a result set. When you put that in, you're asking for the entire rows to be checked for uniqueness.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • If you look at the execution plan for the DISTINCT, you're getting a huge load because of the DISTINCT operation itself. It's joining ALL the data between the three tables using two hash joins and then doing a Hash Distinct to clear out the duplicate data. The DISTINCT clause is usually a crutch used to hide bad structure or bad code. You're almost always better off not using the DISTINCT clause. Check your data and see if you're pulling it back correctly or if you need more parameters to get the right data set.

    Also, if you look at the XML of the execution plan. You've got a missing index alert that might help a bit.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • To address the concerns on why I'm using a join .....

    The view is a source to a report. The application works in this manner....

    1. User selects report type

    2. User enters some preliminary requirements (typically date range)

    3. User then selects remaining requirements.

    Where this query is coming into play is at #3.

    I have to display back only legitimate options. So, if the person selected January for the date range then I need to show options that are pertinent to a January report. For this, I can't just use the report that stores the employees .... I need to hit against the source view for the report so that I know that the options I'm listing are going to be valid.

    What I'm not understanding is how to track down where the problem is coming from. I understand that a DISTINCT does more work but I've never had one run longer than the original query.

    I just saw the suggestion for looking at the SQL Plan in XML mode and the missing index so I'll try that right now.

  • Putts (2/20/2009)


    To address the concerns on why I'm using a join .....

    The view is a source to a report. The application works in this manner....

    1. User selects report type

    2. User enters some preliminary requirements (typically date range)

    3. User then selects remaining requirements.

    Where this query is coming into play is at #3.

    I have to display back only legitimate options. So, if the person selected January for the date range then I need to show options that are pertinent to a January report. For this, I can't just use the report that stores the employees .... I need to hit against the source view for the report so that I know that the options I'm listing are going to be valid.

    What I'm not understanding is how to track down where the problem is coming from. I understand that a DISTINCT does more work but I've never had one run longer than the original query

    Then that's good luck on your part. I've seen DISTINCT simply destroy performance. If you look at your two execution plans, you can see it at work. It's moving every single bit of data out of the three tables, combining it together and then finding the unique values through the Hash Distinct operation. So you're not getting just the rows that match your criteria. It's processing everything up front and then filtering down to the data you need.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (2/20/2009)

    Then that's good luck on your part. I've seen DISTINCT simply destroy performance. If you look at your two execution plans, you can see it at work. It's moving every single bit of data out of the three tables, combining it together and then finding the unique values through the Hash Distinct operation. So you're not getting just the rows that match your criteria. It's processing everything up front and then filtering down to the data you need.

    So, if I understand you correctly on this ......

    My view, with no WHERE clause, will return millions of records. With the WHERE clause being used it's narrowed down to a few hundred which is what the result set for the SELECT is.

    What you're saying is that the SELECT DISTINCT will first process the possible millions and then apply the WHERE? .... because it wants to find the unique values first and then see which ones are valid for the WHERE?

    That just seems crazy. I've done distincts on resultsets of millions of rows and had the time reduced immensely so why is it with this one query it's going in the total opposite direction?

    I have to believe there's a better solution for this than "don't use Distincts" :/

  • Putts (2/20/2009)


    So, if I understand you correctly on this ......

    My view, with no WHERE clause, will return millions of records. With the WHERE clause being used it's narrowed down to a few hundred which is what the result set for the SELECT is.

    What you're saying is that the SELECT DISTINCT will first process the possible millions and then apply the WHERE? .... because it wants to find the unique values first and then see which ones are valid for the WHERE?

    That just seems crazy. I've done distincts on resultsets of millions of rows and had the time reduced immensely so why is it with this one query it's going in the total opposite direction?

    I have to believe there's a better solution for this than "don't use Distincts" :/

    Really? DISTINCT has consistently improved your performance? That's incredibly unique. Search through old posts. It's got a well-deserved bad reputation.

    I'm just saying what I see the execution plan doing. That's how it's choosing to process the data.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (2/20/2009)

    Really? DISTINCT has consistently improved your performance? That's incredibly unique. Search through old posts. It's got a well-deserved bad reputation.

    I'm just saying what I see the execution plan doing. That's how it's choosing to process the data.

    Perhaps I have always just run into the perfect scenarios for using distinct and now have crawled into others where it is not as useful.

    Are there certain cases where distinct have always shown to increase performance?

    Asking around fellow developers where I work - none of them have ever seen Distincts take longer than natural SELECTs so maybe this is something that only affects more advanced T-SQL queries where you folks see it on a daily basis and the regular "joe schmoe" SQL query writer does not.

  • Putts (2/20/2009)


    I've done distincts on resultsets of millions of rows and had the time reduced immensely so why is it with this one query it's going in the total opposite direction?

    Post a reproduction please? It'll be the first case I've ever seen of that.

    Adding distinct to a query adds work. It requires that SQL sort or hash the resultset (at the last step) and eliminate duplicates.

    There is no way that adding a sort can make the query faster. If there's already a sort, then the distinct won't add any additional time.

    The only possible reason I can think of has to do with network and display. If the distinct removes a lot of rows then the resultset will be transmitted faster (because there's less rows) and displayed faster (because there's less to display)

    It cannot reduce the processing time.

    I have to believe there's a better solution for this than "don't use Distincts" :/

    Don't use distinct unless really, really necessary and, if you do have to, accept the performance hit.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Putts (2/20/2009)


    Are there certain cases where distinct have always shown to increase performance?

    None. It will either have no effect (if there's already an order by) or reduce performance (in all other cases)

    Asking around fellow developers where I work - none of them have ever seen Distincts take longer than natural SELECTs

    Do they typically performance test their queries? Do they they work with a couple hundred rows or a few million? Do they always put an ORDER BY on their queries?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (2/20/2009)

    Post a reproduction please? It'll be the first case I've ever seen of that.

    Adding distinct to a query adds work. It requires that SQL sort or hash the resultset (at the last step) and eliminate duplicates.

    There is no way that adding a sort can make the query faster. If there's already a sort, then the distinct won't add any additional time.

    The only possible reason I can think of has to do with network and display. If the distinct removes a lot of rows then the resultset will be transmitted faster (because there's less rows) and displayed faster (because there's less to display)

    It cannot reduce the processing time.

    Well, without using much imagination I just grabbed a large table I use quite a bit .... did a SELECT * of the whole thing and then a SELECT DISTINCT on one of the columns .... which I think doesn't even have an index on it.

    SELECT * completes in about a minute and returns 2.9 million rows

    SELECT DISTINCT completes in 0 seconds and returns 36 rows.

    Now, you have to understand that this query is not living within a vacuum on the server and is not dependent upon returning the results. The return of the results is the key element of the query as I have to use the results in my application.

    Attached are both Exec plans for the SELECT * vs SELECT DISTINCT.

  • Putts (2/20/2009)


    Perhaps I have always just run into the perfect scenarios for using distinct and now have crawled into others where it is not as useful.

    Are there certain cases where distinct have always shown to increase performance?

    Asking around fellow developers where I work - none of them have ever seen Distincts take longer than natural SELECTs so maybe this is something that only affects more advanced T-SQL queries where you folks see it on a daily basis and the regular "joe schmoe" SQL query writer does not.

    I think Gail might have it, it's moving fewer rows, so that's saving processing time. Let's take an example where identical number of rows are returned. You can run both these on AdventureWorks

    SELECT sod.SalesOrderID,sod.ProductID FROM

    Sales.SalesOrderDetail AS sod

    WHERE sod.SalesOrderId = 43666

    SELECT DISTINCT sod.SalesOrderID,sod.ProductID FROM

    Sales.SalesOrderDetail AS sod

    WHERE sod.SalesOrderID = 43666

    And if you look at the STATISTICS IO & TIME:

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 0 ms.

    (6 row(s) affected)

    Table 'SalesOrderDetail'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    (6 row(s) affected)

    Table 'SalesOrderDetail'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 15 ms.

    SQL Server parse and compile time:

    And I uploaded a screen shot of the execution plans (sorry, being a bit lazy). You can see that while the exact same data was returned, more work was done by the second query. The estimated costs were 18% & 82% of the over all.

    DISTINCT causes performance issues. It just does.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Putts (2/20/2009)


    GilaMonster (2/20/2009)

    Post a reproduction please? It'll be the first case I've ever seen of that.

    Adding distinct to a query adds work. It requires that SQL sort or hash the resultset (at the last step) and eliminate duplicates.

    There is no way that adding a sort can make the query faster. If there's already a sort, then the distinct won't add any additional time.

    The only possible reason I can think of has to do with network and display. If the distinct removes a lot of rows then the resultset will be transmitted faster (because there's less rows) and displayed faster (because there's less to display)

    It cannot reduce the processing time.

    Well, without using much imagination I just grabbed a large table I use quite a bit .... did a SELECT * of the whole thing and then a SELECT DISTINCT on one of the columns .... which I think doesn't even have an index on it.

    SELECT * completes in about a minute and returns 2.9 million rows

    SELECT DISTINCT completes in 0 seconds and returns 36 rows.

    Now, you have to understand that this query is not living within a vacuum on the server and is not dependent upon returning the results. The return of the results is the key element of the query as I have to use the results in my application.

    Attached are both Exec plans for the SELECT * vs SELECT DISTINCT.

    And how much of that "minute" is the data coming back from the server? Plus, in the first you are asking for all the data. Just for s & g's try doing a select distinct col1, col2,... from yourtable and compare that to the select * from yourtable, then you will be comparing similiar queries.

  • But that example compares one thing you shouldn't do, SELECT *, with another thing that has serious enough costs that it almost needs to be in the shouldn't do category, SELECT DISTINCT. Hardly a good comparison. Check out mine, where the number of rows are the same and a good index is available for either query to use.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

Viewing 15 posts - 1 through 15 (of 57 total)

You must be logged in to reply to this topic. Login to reply