Distinct query with all columns

  • Hi Guys

    I am just looking at below the query which is hitting our database quite a lot. This is taking about 25 secs to return around 1500 records.

    SELECT distinct [ExecutionId]

    ,[ServerId]

    ,UPPER([ApiResult]) AS ApiResult

    ,[StartTimeId]

    ,[EndTimeId]

    ,UPPER([FlowName]) AS FlowName

    ,UPPER([Initiator]) AS Initiator

    ,UPPER([ApiName]) AS ApiName

    ,UPPER([MethodName]) AS MethodName

    ,[Duration]

    ,[OrganisationalUnitId]

    FROM

    [TempAutomatorApiPerformance]

    (1462 row(s) affected)

    Table 'TempAutomatorApiPerformance'. Scan count 5, logical reads 72751, physical reads 0, read-ahead reads 188, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    (1 row(s) affected)

    SQL Server Execution Times:

    CPU time = 49518 ms, elapsed time = 25497 ms.

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 0 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    If i remove the column 'methodname' from the select, it returns in 4 secs. I have already spoken to the app guys about not retreiving all columns and using where clause...The answer was no.
    Is there anything we can do to speed this up? Appreciate your help.

    Thanks.

    CREATE TABLE [dbo].[TempAutomatorApiPerformance](

    [ExecutionId] [uniqueidentifier] NOT NULL,

    [ServerId] [smallint] NULL,

    [ApiResult] [nvarchar](50) NULL,

    [StartTimeId] [datetime2](0) NOT NULL,

    [EndTimeId] [datetime2](0) NOT NULL,

    [FlowName] [nvarchar](300) NOT NULL,

    [Initiator] [nvarchar](500) NOT NULL,

    [ApiName] [nvarchar](300) NOT NULL,

    [MethodName] [nvarchar](300) NOT NULL,

    [Duration] [decimal](11, 3) NULL,

    [OrganisationalUnitId] [int] NULL

    ) ON [PRIMARY]

  • As-is, there's little to nothing you can do. The query has no where clause, so there's no indexes that will help. The distinct forces a sort (or hash table) and that'll be slow.

    72k reads  is a little odd for 1462 rows. If you remove the distinct, how many rows are returned?
    If a lot more than 2000, is it possible to remove duplicates from the table so that you don't have to query them and throw them away?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Dedupe the table.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • GilaMonster - Wednesday, July 19, 2017 10:22 AM

    As-is, there's little to nothing you can do. The query has no where clause, so there's no indexes that will help. The distinct forces a sort (or hash table) and that'll be slow.

    72k reads  is a little odd for 1462 rows. If you remove the distinct, how many rows are returned?
    If a lot more than 2000, is it possible to remove duplicates from the table so that you don't have to query them and throw them away?

    Gail, it's 1462 rows duped out to 3179754 rows - and it's a heap, too.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Looking at the execution plan, the slowness appears to be caused by the sorting which is due to the DISTINCT.

    It looks like you have a LOT of duplicate data in that system.  You have 3 million rows which your distinct drops down to 1462 rows just by adding that distinct.
    Do you require the duplicate rows?  If not, removing the duplicate rows will improve performance as you can remove the DISTINCT (which is the slowest part of your operation).
    If you do require the duplicate rows in the table, you should be able to get a performance increase by adding an index onto the table.

    But having 3 million rows in the table and only 1462 distinct values is a lot of wasted space.

    EDIT - never mind on the index.  Gail indicated that an index won't help and she knows that stuff better than I.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • If you're in a case insensitive collation you can try taking out the UPPER() and tell the app guy to deal with formatting it on his end.

  • Thanks guys. We have already raised the concern about more duplicate values with the developers and the app team and also about not using where clause and selecting all columns.

    If there's nothing else we can do, they will just have to think about this whole thing again.

  • Yup, de-dupe the table.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Taking the UPPER out only impoves it by a sec or two.

  • Is there an additional column (or columns) that might make the duplicate rows unique?
    If so, you could normalize the table to reduce the duplication of data.
    I hope the Temp in the table doesn't mean that the table was meant to be temporary.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Not to be mean, but if you can't adjust the query and you can't de-duplicate the data, what do the developers think you can do?

    The only other option I can think to do would be to move the data warehouse, even if it is on the same DB.  Just have a table that contains the distinct values from that table.  Put an insert trigger on the table and have it insert into the deduplicated version of the table with each insert if the item doesn't exist in the deduplicated version of the table.
    This will slow down your inserts a bit as each insert may need to happen twice, and you will end up with duplicate data as you will have 2 tables with the data in it.  But you will get the performance benefit of not needing the "DISTINCT" clause on your select as you would just select from the de-duplicated table.

    I would go to your developers and tell them what their options are:
    1 - use the application to handle de-duplication
    2 - remove duplicate data from the database
    3 - filter the results (ie a WHERE clause)
    4 - create new deduplicated table based on the duplicated data

    I think option 4 is the worst of the above options as it results in even more data duplication, but it is an option.
    There are only so many things you can do to make things faster on the SQL side, and if the developers don't like any of your solutions, they may just have to live with the slowness.
    Depending on the code language, the way the application consumes the data, and how you can identify duplicates, you may even get a performance boost by de-duplicating it on the application side instead of on the SQL side.

    How long does the query take to run without the DISTINCT?

    If you are allowed to change the table structure, you could add a "duplicate count" column and then put an INSTEAD OF trigger on the table.  If it detects a duplicate value, it would increment the duplicate count value instead of inserting.  I can't think of any system where having rows that are exact duplicates of each other would be beneficial, but you may have a system where having exact duplicates in the table is needed for something.  But in that case, I think that a "duplicate count" column would serve the same purpose and would save you a LOT of disk space and data retrieval time.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Like Chris and Gail have said, de-dupe the table.  Are you able to do so?  Are you interested in a performant way to do it?

    That's a pretty wide table to do a DISTINCT on - that's why your reads are going to be high.

  • Thanks Guys for the advise and suggestion. I have raised it with the dev and app team.

    bmg - without the DISTINCT, it takes 1 min and 32 secs.

    Ed - Will be good to know the performant way to do it.
    I normaly just follow this method:
    https://support.microsoft.com/en-us/help/139444/how-to-remove-duplicate-rows-from-a-table-in-sql-server

  • With that level of duplicates, and no foreign keys, easiest is probably to SELECT DISTINCT *  INTO <new table> FROM ..., then drop (or rename) the original table, rename the new one to take its place

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gail and bmg's option 4 is exactly right - you have so many duplicates that it would be much faster to do what she said.

    For next time, if you can't prevent the duplicates from getting into the table in the first place, here's an approach to de-dupe it before it gets this bad.  First, using your table definition, I'll create a test table to work with.  You don't want to do this with your real table.

    --Create the table for testing
    IF OBJECT_ID('dbo.TestDelete', 'u') is not null drop table dbo.TestDelete;
    CREATE TABLE [dbo].[TestDelete](
      [ExecutionId] [uniqueidentifier] NOT NULL,
      [ServerId] [smallint] NULL,
      [ApiResult] [nvarchar](50) NULL,
      [StartTimeId] [datetime2](0) NOT NULL,
      [EndTimeId] [datetime2](0) NOT NULL,
      [FlowName] [nvarchar](300) NOT NULL,
      [Initiator] [nvarchar](500) NOT NULL,
      [ApiName] [nvarchar](300) NOT NULL,
      [MethodName] [nvarchar](300) NOT NULL,
      [Duration] [decimal](11, 3) NULL,
      [OrganisationalUnitId] [int] NULL);

    --Create some test data to use for testing - 100K rows ought to do it
    WITH cte AS (
    SELECT N
      FROM (VALUES(1), (2), (3), (4), (5), (6), (7), (8), (9), (10)) x (N)
    ),
    cteTally AS (
    SELECT N = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
      FROM cte t
      CROSS JOIN cte t2
      CROSS JOIN cte t3
      CROSS JOIN cte t4
      CROSS JOIN cte t5
    )
    INSERT INTO dbo.TestDelete(ExecutionID, ServerID, ApiResult, StartTimeID, EndTimeID,
      FlowName, Initiator, ApiName, MethodName, Duration, OrganisationalUnitID)
    SELECT ExecutionID, ServerID, ApiResult, StartTimeID, EndTimeID,
      FlowName, Initiator, ApiName, MethodName, Duration, OrganisationalUnitID
      FROM (VALUES(NEWID(), 1, 'Result', getdate(), getdate(),
          'Flow', 'Initiator', 'API Name', 'Method', 0, 14)
       ) x (ExecutionID, ServerID, ApiResult, StartTimeID, EndTimeID,
         FlowName, Initiator, ApiName, MethodName, Duration, OrganisationalUnitID)
      CROSS APPLY cteTally;
      
    --Update the table so that 10% of the ExecutionId values are the same
    WITH cteRows AS (
    SELECT RN = ROW_NUMBER() OVER(ORDER BY ExecutionId), ExecutionId
      FROM dbo.TestDelete
    )
    UPDATE cteRows
    SET ExecutionId = 'A2AB8B9C-E7FE-4FF1-B46E-4D60971993B7'
    WHERE RN % 10 = 0;

    --Now confirm that we actually have 10K duplicate values
    SELECT ExecutionId, COUNT(*)
    FROM dbo.TestDelete
    GROUP BY ExecutionId
    HAVING COUNT(ExecutionId) > 1;

    You've now confirmed that there are duplicates in the table.  After you meet with the owner of the data, they decide that they only want to de-dupe on ExecutionId and keep the one row with the most recent StartTimeId.  You can use the ROW_NUMBER windowing function to assign a row number to each individual ExecutionId.  Each one is called a partition.  To de-dupe, simply delete the ones where the row number is greater than 1.

    --De-dupe the table based on ExecutionId only, keeping the most recent StartTimeId
    WITH cte AS (
    SELECT ExecutionId, RN = ROW_NUMBER() OVER(PARTITION BY ExecutionID ORDER BY StartTimeId DESC)
      FROM dbo.TestDelete
    )
    DELETE FROM cte
    WHERE RN > 1;

    To see which ones would be deleted without deleting them, simply change the DELETE FROM to a SELECT * to query them.

    This approach is also very flexible and doesn't rely on concatenation of values with different data types.  If, for example, they decide they want to de-dupe by ExecutionId and ServerId and ApiResult and keep the one with the oldest EndTimeId, simply change the delete statement to this:

    WITH cte AS (
    SELECT ExecutionId, RN = ROW_NUMBER() OVER(PARTITION BY ExecutionID, ServerId, ApiResult ORDER BY EndTimeId)
      FROM dbo.TestDelete
    )
    DELETE FROM cte
    WHERE RN > 1;

    The key to identifying the duplicates in the ROW_NUMBER function is to PARTITION BY all the columns that identifies a row as being a dupe and ORDER BY the column that sorts the one you want to keep to the top of each partition. I hope this helps.

Viewing 15 posts - 1 through 15 (of 27 total)

You must be logged in to reply to this topic. Login to reply