Why is this query so slow?

  • I have one "big" table with 300 million rows... 
    CREATE TABLE rstore.OutputDataset(OutputDatasetId     INT NOT NULL IDENTITY(1,1)
               ,OutputDatasetHeaderKey  INT NOT NULL
               ,DateKey       INT NOT NULL
               ,SafeRegionKey     SMALLINT NOT NULL
               ,SafeCountryKey     SMALLINT NOT NULL
               ,Attribute1ValueKey   INT NULL
               ,Attribute2ValueKey   INT NULL
               ,Attribute3ValueKey   INT NULL
               ,Attribute4ValueKey   INT NULL
               ,Attribute5ValueKey   INT NULL
               ,Attribute6ValueKey   INT NULL
               ,Attribute7ValueKey   INT NULL
               ,Attribute8ValueKey   INT NULL
               ,Attribute9ValueKey   INT NULL
               ,Attribute10ValueKey   INT NULL
               ,SalesUnits      DECIMAL(19, 4) NOT NULL
               ,SalesValue      DECIMAL(19, 4) NOT NULL
               ,PriorSalesUnits    DECIMAL(19, 4) NOT NULL DEFAULT 0
               ,PriorSalesValue    DECIMAL(19, 4) NOT NULL DEFAULT 0
               ,DataStatusTypeKey    TINYINT NOT NULL
                ,CONSTRAINT PK_OutputDataset PRIMARY KEY NONCLUSTERED(OutputDatasetId)
                ,CONSTRAINT FK_OutputDataset_... quite a few foreign keys
               );

    BUT a beautiful clustered index that reduces the number of rows fetched to ~300,000 (there is no non clustered index... since my clustered index is so big anyway)
    CREATE UNIQUE CLUSTERED INDEX UCIX_OutputDataSet ON rstore.OutputDataset(OutputDatasetHeaderKey
                            ,DateKey, SafeRegionKey, SafeCountryKey
                            ,Attribute1ValueKey ,Attribute2ValueKey ,Attribute3ValueKey ,Attribute4ValueKey ,Attribute5ValueKey ,Attribute6ValueKey ,Attribute7ValueKey ,Attribute8ValueKey ,Attribute9ValueKey ,Attribute10ValueKey
                            );

    ALL our queries specify the column OutputDatasetHeaderKey (which shrinks the number of rows by a lot), usually followed by a DateKey range, and most of the time by the Region Key, etc

    The attached query is disappointingly slow (~11 seconds on my test server)
    The two temp tables get loaded with about 10,000 rows each, half of them useful.
    The main table retrieves about 300.000 rows through the clustered index.
    In my particular case, the columns Attribute1Key to Attribute5Key have values, therefore the first 9 columns of the clustered index are not null.
    What I don't understand is that the optimiser does not seem to be able to use the fact that all three clustered indexes have similar columns in the same order.
    The temp tables, the clustered PK is DateKey, SafeRegionKey and Attribute3Key, all of which appear in my main table clustered index.
    Although I don't filter my main table on SafeRegionKey, this column is still NOT NULL and part of the clustered index...

    Not quite sure how to improve this query...
    As a side effect, if I specify a SafeRegionKey to filter on, then everything seems fine... (the last statement is actually a dynamic SQL I generate and execute through sp_executesql)

    I can see that after the merge between OutputDataset and #GrossupUnit, ~16 million rows have been read... but no idea how I can stop it :crazy:

    Any suggestions?

    Thanks

    Eric

  • I very much doubt you need all of those columns in your Clustered Index. Have a look at this article from Redgate: https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/.

    Notice, the description says a Clustered index should be Unique, Static, Ever-Increasing and (firstly) narrow. 14 columns (if i counted correctly) is far too many columns for a clustered index, and I doubt that those values are Static either.

    The first step is reviewing your Clustered Index, and the next will be what other indexes will be beneficial for your table.

    Also, the fact that you have columns AttributeKey1 to Attribute10 implies you have de-normalised data. This isn't going to help for performance matters either; I strongly suggest reconsidering your data structure there and using a normalised format.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Why not grab those two lookups into one? Other than that, what Thom said. The clustered index is far too wide. Don't forget that any ordinary index on the same table will contain the keys of the clustered index.

    SELECT

    DateKey,

    SafeRegionKey,

    SafeFormFactorKey,

    GrossupUnit = MAX(CASE WHEN T.MetadataTypeName = 'Grossup Unit' THEN MetadataValue ELSE NULL END),

    GrossupValue = MAX(CASE WHEN T.MetadataTypeName = 'Grossup Value' THEN MetadataValue ELSE NULL END)

    INTO #GrossupUnitValue

    FROM rstore.DPAPMetadata M

    INNER JOIN rstore.MetadataType T

    ON T.MetadataTypeId = M.MetadataTypeKey

    AND T.MetadataTypeName IN ('Grossup Unit','Grossup Value')

    WHERE RReleaseKey = @RReleaseKey

    GROUP BY DateKey, SafeRegionKey, SafeFormFactorKey

    CREATE UNIQUE CLUSTERED INDEX ucx_Stuff ON #GrossupUnitValue (DateKey, SafeRegionKey, SafeFormFactorKey)

    SELECT DS.DateKey

    ,SalesUnits = SUM(SalesUnits * ISNULL(GUV.GrossupUnit, 1.0))

    ,SalesValue = SUM(SalesValue * ISNULL(GUV.GrossupValue, 1.0))

    ,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GUV.GrossupUnit, 1.0))

    ,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GUV.GrossupValue, 1.0))

    FROM rstore.OutputDataset DS

    LEFT JOIN #GrossupUnitValue GUV

    ON GUV.DateKey = DS.DateKey

    AND GUV.SafeRegionKey = DS.SafeRegionKey

    AND GUV.Attribute3Key = DS.Attribute3ValueKey

    WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey

    AND DS.DateKey >= @DateIdFrom

    AND DS.DateKey <= @DateIdTo

    GROUP BY DS.DateKey

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • The idea to join these two lookups is interesting indeed.
    I don't think it will help in this immediate problem but it's certainly something I should do.

  • Another way of writing the query:
    ;WITH DS AS
    (
      SELECT DISTINCT
           DS.DateKey,
           DS.SafeRegionKey,
           DS.Attribute3ValueKey
      FROM rstore.OutputDataset DS
      WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
       AND DS.DateKey >= @DateIdFrom
       AND DS.DateKey <= @DateIdTo
    ),
    GU AS
    (
      SELECT GU.DateKey,
        SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) SalesUnits,
        SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) PriorSalesUnits
      FROM DS
      INNER JOIN #GrossupUnit GU
         ON GU.DateKey = DS.DateKey
        AND GU.SafeRegionKey = DS.SafeRegionKey
        AND GU.Attribute3ValueKey = DS.SafeRegionKey
    ),
    GV AS
    (
      SELECT GV.DateKey,
        SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) SalesValue,
        SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) PriorSalesValue
      FROM DS
      INNER JOIN #GrossupValue GV
         ON GV.DateKey = DS.DateKey
        AND GV.SafeRegionKey = DS.SafeRegionKey
        AND GV.Attribute3ValueKey = DS.SafeRegionKey
    )
    SELECT DS.DateKey,
       ISNULL(Units.SalesUnits,0) SalesUnits,
       ISNULL(SalesValue.SalesValue,0) SalesValue,
       ISNULL(Units.PriorSalesUnits,0) PriorSalesUnits,
       ISNULL(SalesValue.PriorSalesValue,0) PriorSalesValue
    FROM DS
    LEFT JOIN GU
       ON GU.DateKey = GS.DateKey
    LEFT JOIN SalesValue
       ON SU.DateKey = DS.DateKey

    You might get even better performance if you create a new temporary table from the query:
    SELECT DISTINCT
        DS.DateKey,
        DS.SafeRegionKey,
        DS.Attribute3ValueKey
      INTO #DistinctOutputDataset
      FROM rstore.OutputDataset DS
      WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
       AND DS.DateKey >= @DateIdFrom
       AND DS.DateKey <= @DateIdTo

    (with appropriate indexes)
    and use it to replace the "DS" CTE

  • Eric Mamet - Thursday, May 17, 2018 6:22 AM

    The idea to join these two lookups is interesting indeed.
    I don't think it will help in this immediate problem but it's certainly something I should do.

    The individual queries both use DISTINCT. Aggregating the extra column in my version won't add much to the total cost of one of the original queries. I'd expect the overall cost to be around half that of the original batch, and the third query is reduced in complexity. Generally this helps to explore tuning possibilities.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thom A - Thursday, May 17, 2018 5:45 AM

    I very much doubt you need all of those columns in your Clustered Index. Have a look at this article from Redgate: https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/.

    Notice, the description says a Clustered index should be Unique, Static, Ever-Increasing and (firstly) narrow. 14 columns (if i counted correctly) is far too many columns for a clustered index, and I doubt that those values are Static either.

    The first step is reviewing your Clustered Index, and the next will be what other indexes will be beneficial for your table.

    Also, the fact that you have columns AttributeKey1 to Attribute10 implies you have de-normalised data. This isn't going to help for performance matters either; I strongly suggest reconsidering your data structure there and using a normalised format.

    HI Thom,

    Thanks for your answer and the interesting link.

    I think I am quite familiar with all this but, as often, there are exceptions to general rules.
    In fact, my clustered index is not far from the recommendations in that article.

    • Narrow: Ok, I fail on this one... BUT I do not want any non clustered index, therefore it does not matter so much
    • Unique: Yes
    • Static: Yes, actually, we don't ever update these rows
    • Ever-increasing : Yes, we "could" delete an entire batch of rows (identified by the first column OutputDatasetHeaderKey) but it remains relatively exceptional
    The reason for these 10 generic attributes is that the content of the rows will depend on the type of data we store in there.
    The type of data stored is indirectly indicated by the value of the first column (OutputDatasetHeaderKey).

    It all works pretty well because the first column in the clustered index is very selective (I don't have more than 600,000 rows for any value).
    The query I refer to is the unfortunate exception...

  • Jonathan AC Roberts - Thursday, May 17, 2018 6:23 AM

    Another way of writing the query:
    ;WITH DS AS
    (
      SELECT DISTINCT
           DS.DateKey,
           DS.SafeRegionKey,
           DS.Attribute3ValueKey
      FROM rstore.OutputDataset DS
      WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
       AND DS.DateKey >= @DateIdFrom
       AND DS.DateKey <= @DateIdTo
    ),
    GU AS
    (
      SELECT GU.DateKey,
        SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0)) SalesUnits,
        SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0)) PriorSalesUnits
      FROM DS
      INNER JOIN #GrossupUnit GU
         ON GU.DateKey = DS.DateKey
        AND GU.SafeRegionKey = DS.SafeRegionKey
        AND GU.Attribute3ValueKey = DS.SafeRegionKey
    ),
    GV AS
    (
      SELECT GV.DateKey,
        SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0)) SalesValue,
        SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0)) PriorSalesValue
      FROM DS
      INNER JOIN #GrossupValue GV
         ON GV.DateKey = DS.DateKey
        AND GV.SafeRegionKey = DS.SafeRegionKey
        AND GV.Attribute3ValueKey = DS.SafeRegionKey
    )
    SELECT DS.DateKey,
       ISNULL(Units.SalesUnits,0) SalesUnits,
       ISNULL(SalesValue.SalesValue,0) SalesValue,
       ISNULL(Units.PriorSalesUnits,0) PriorSalesUnits,
       ISNULL(SalesValue.PriorSalesValue,0) PriorSalesValue
    FROM DS
    LEFT JOIN GU
       ON GU.DateKey = GS.DateKey
    LEFT JOIN SalesValue
       ON SU.DateKey = DS.DateKey

    You might get even better performance if you create a new temporary table from the query:
    SELECT DISTINCT
        DS.DateKey,
        DS.SafeRegionKey,
        DS.Attribute3ValueKey
      INTO #DistinctOutputDataset
      FROM rstore.OutputDataset DS
      WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
       AND DS.DateKey >= @DateIdFrom
       AND DS.DateKey <= @DateIdTo

    (with appropriate indexes)
    and use it to replace the "DS" CTE

    Very interesting idea! 
    I'll have to give that a shot as well

    Many Thanks

  • ChrisM@Work - Thursday, May 17, 2018 6:37 AM

    Eric Mamet - Thursday, May 17, 2018 6:22 AM

    The idea to join these two lookups is interesting indeed.
    I don't think it will help in this immediate problem but it's certainly something I should do.

    The individual queries both use DISTINCT. Aggregating the extra column in my version won't add much to the total cost of one of the original queries. I'd expect the overall cost to be around half that of the original batch, and the third query is reduced in complexity. Generally this helps to explore tuning possibilities.

    I get your point.
    According to the query plan, it looks like it's the first merging between one of the temp table and the big table so I would not expect it to split the cost in half.
    However, it's more elegant and simpler so definitely worth a shot.

    Thank you very much folks, I did not expect such quick and good answers.

    VERY appreciated

    Eric

  • One thing I noticed in the execution plan is the residual IO.  Actual rows is 321,374 but the number of rows read is 15,747,409.  Seeing this with four of the operators in the execution plan.  This could be slowing it down.  But then again I could be wrong.

  • Just curious, but looking at the execution plans for the queries populating the temp tables, is the DISTINCT actually needed?  The number of rows entering and leaving the sort operator ar the same.

  • Lynn Pettis - Thursday, May 17, 2018 7:49 AM

    One thing I noticed in the execution plan is the residual IO.  Actual rows is 321,374 but the number of rows read is 15,747,409.  Seeing this with four of the operators in the execution plan.  This could be slowing it down.  But then again I could be wrong.

    I think you are absolutely right and it is precisely what I am after.
    I suspect the various ideas I was given above might help.

    Unfortunately, I vastly simplified the initial query context because some of that code is generated on the fly (hence the use of sp_executesql) so I'll have to be a bit careful how I change the thing.

  • Your clustered index is fine.
    You really don't need the temp tables for this specific query, but if building them is not taking much time, then that's OK too.

    But in the final query, for this specific query, since you're only grouping by DateKey, you can pre-summarize the temp tables and avoid a lot of JOIN overhead.  Since this may be a generic process, not sure how well you'll able to adjust the code that gens the query to do this.


    SELECT DS.DateKey
        ,SalesUnits = SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0))
        ,SalesValue = SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0))
        ,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0))
        ,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0))
     FROM rstore.OutputDataset DS
     LEFT JOIN (
         SELECT DateKey, SUM(GrossupUnit) AS GrossupUnit
         FROM #GrossupUnit
         GROUP BY DateKey
      ) AS GU ON GU.DateKey = DS.DateKey
     LEFT JOIN (
         SELECT DateKey, SUM(GrossupValue) AS GrossupValue
         FROM #GrossupValue
         GROUP BY DateKey
    ) AS GV ON GV.DateKey = DS.DateKey
    WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
      AND DS.DateKey >= @DateIdFrom
      AND DS.DateKey <= @DateIdTo
    GROUP BY DS.DateKey

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Thursday, May 17, 2018 8:24 AM

    Your clustered index is fine.
    You really don't need the temp tables for this specific query, but if building them is not taking much time, then that's OK too.

    But in the final query, for this specific query, since you're only grouping by DateKey, you can pre-summarize the temp tables and avoid a lot of JOIN overhead.  Since this may be a generic process, not sure how well you'll able to adjust the code that gens the query to do this.


    SELECT DS.DateKey
        ,SalesUnits = SUM(SalesUnits * ISNULL(GU.GrossupUnit, 1.0))
        ,SalesValue = SUM(SalesValue * ISNULL(GV.GrossupValue, 1.0))
        ,PriorSalesUnits = SUM(PriorSalesUnits * ISNULL(GU.GrossupUnit, 1.0))
        ,PriorSalesValue = SUM(PriorSalesValue * ISNULL(GV.GrossupValue, 1.0))
     FROM rstore.OutputDataset DS
     LEFT JOIN (
         SELECT DateKey, SUM(GrossupUnit) AS GrossupUnit
         FROM #GrossupUnit
         GROUP BY DateKey
      ) AS GU ON GU.DateKey = DS.DateKey
     LEFT JOIN (
         SELECT DateKey, SUM(GrossupValue) AS GrossupValue
         FROM #GrossupValue
         GROUP BY DateKey
    ) AS GV ON GV.DateKey = DS.DateKey
    WHERE DS.OutputDatasetHeaderKey = @OutputDatasetHeaderKey
      AND DS.DateKey >= @DateIdFrom
      AND DS.DateKey <= @DateIdTo
    GROUP BY DS.DateKey

    I think you are right but the "real" code is actually generic so I don't know in advance what filters I'll have AND what aggregation will be required.

    However, I only have a performance problem in this particular scenario (as far as I can tell) so I might try to create a specific sub procedure to implement just this particular case.
    This would allow me to make some assumptions and take into consideration the various advice I just got.
    I also have occasional problems with re-compilations so having specialised sub-procedure(s) is probably useful in that sense too.

  • Lynn Pettis - Thursday, May 17, 2018 7:53 AM

    Just curious, but looking at the execution plans for the queries populating the temp tables, is the DISTINCT actually needed?  The number of rows entering and leaving the sort operator ar the same.

    Well spotted!

    The DISTINCTs are necessary because whoever designed these tables made a mistake and we store duplicates!
    Not much I can do about this one... I'd like to redesign these tables but not sure whether this will ever happen... :Whistling:

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply