Query tuning with conditional aggregation

  • Hi all,

    I have a stored procedure that returns a table 1Million + rows that I then need to pivot (I can't modify the stored procedure as it makes a call to a proprietary .exe). I first tried using the pivot operator to do what I need but that takes about 7 minutes to generate a 1440 row table with 900+ columns. I managed to reduced the execution time to a little over 2 minutes by using conditional aggregation with dynamically generated column names. I'm still looking for a way to reduce the execution even further to be acceptable for the user but am stumped as to any further optimizations I could do.


    INSERT INTO Exampledb.dbo.DestinationTable_EF278
    SELECT
    Timestamp,
    [NC_S\Column 1] = SUM(CASE WHEN TagName='NC_S\Column 1' THEN value END),
    [NC_S\Column 2] = SUM(CASE WHEN TagName='NC_S\Column 2' THEN value END),
    [NC_S\Column 3] = SUM(CASE WHEN TagName='NC_S\Column 3' THEN value END),
    [...]Repeat 900 times[...]
    [NC_S\Column 901] = SUM(CASE WHEN TagName='NC_S\Column 901' THEN value END),
    FROM Exampledb.dbo.SourceTable_EF278
    GROUP BY Timestamp
    ORDER BY Timestamp

    Any ideas how to make this faster ?
    Thanks !
    prefet

  • prefet - Monday, January 21, 2019 9:27 AM

    Hi all,

    I have a stored procedure that returns a table 1Million + rows that I then need to pivot (I can't modify the stored procedure as it makes a call to a proprietary .exe). I first tried using the pivot operator to do what I need but that takes about 7 minutes to generate a 1440 row table with 900+ columns. I managed to reduced the execution time to a little over 2 minutes by using conditional aggregation with dynamically generated column names. I'm still looking for a way to reduce the execution even further to be acceptable for the user but am stumped as to any further optimizations I could do.


    INSERT INTO Exampledb.dbo.DestinationTable_EF278
    SELECT
    Timestamp,
    [NC_S\Column 1] = SUM(CASE WHEN TagName='NC_S\Column 1' THEN value END),
    [NC_S\Column 2] = SUM(CASE WHEN TagName='NC_S\Column 2' THEN value END),
    [NC_S\Column 3] = SUM(CASE WHEN TagName='NC_S\Column 3' THEN value END),
    [...]Repeat 900 times[...]
    [NC_S\Column 901] = SUM(CASE WHEN TagName='NC_S\Column 901' THEN value END),
    FROM Exampledb.dbo.SourceTable_EF278
    GROUP BY Timestamp
    ORDER BY Timestamp

    Any ideas how to make this faster ?
    Thanks !
    prefet

    Don't return 900+ columns? I mean seriously how useful can a query be with that many columns? Maybe find a completely different approach to whatever you are trying to do.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • prefet - Monday, January 21, 2019 9:27 AM

    (I can't modify the stored procedure as it makes a call to a proprietary .exe)

    That statement begs for more explanation.  

    How is this proc calling a .exe, and what does the .exe do?

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • @micheal-2, the .exe is part of the historian meant to handle large amounts of time series data. The sproc is here down the line for ingestion in our other tools hence the 900 columns.

  • Do you have 1 row per Timestamp, TagName? If so I think your query is probably about as fast as it will go. If you have many rows per Timestamp, TagName then you would get better performance if you aggregate first:

    ;WITH CTE AS
    (
    SELECT Timestamp,
           TagName,
           SUM(value) value
      FROM Exampledb.dbo.SourceTable_EF278
     GROUP BY Timestamp, TagName
    )
    SELECT Timestamp,
           [NC_S\Column 1] = SUM(CASE WHEN TagName='NC_S\Column 1' THEN value END),
           [NC_S\Column 2] = SUM(CASE WHEN TagName='NC_S\Column 2' THEN value END),
           [NC_S\Column 3] = SUM(CASE WHEN TagName='NC_S\Column 3' THEN value END),
           [...]Repeat 900 times[...]
           [NC_S\Column 901] = SUM(CASE WHEN TagName='NC_S\Column 901' THEN value END)
      FROM CTE
     GROUP BY Timestamp
     ORDER BY Timestamp

  • Thanks. I tried with the additional CTE but it didn't reduce execution time.

  • You need to either:

    Cluster the table by timestamp (if that's how you (almost) always query against the table)
    Or
    Create a non-clus index on (Timestamp) include (TagName, Value)

    Either one of those will avoid having to sort all the data every time the query runs.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • A query without any kind of filtering such as this is frequently only helped by upgrading hardware. Yes, possibly indexing differently could help... maybe (can't see the execution plan, so I have no clue). However, that's only going to help a little. Then it's just down to hardware. When you're moving everything, you need the hardware to support that.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • ScottPletcher - Tuesday, January 22, 2019 10:42 AM

    You need to either:

    Cluster the table by timestamp (if that's how you (almost) always query against the table)
    Or
    Create a non-clus index on (Timestamp) include (TagName, Value)

    Either one of those will avoid having to sort all the data every time the query runs.

    A streaming aggregate, which is often an order of magnitude faster than a hash aggregate, won't be facilitated by either of these suggestions, but by a clustered index on TimeStamp, TagName or an ordinary index on (Timestamp,TagName) include (Value). 

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • We're on 2008, so this won't help, however, a possible Hail Mary on this would be to look at columnstore indexes.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey - Thursday, January 24, 2019 7:50 AM

    We're on 2008, so this won't help, however, a possible Hail Mary on this would be to look at columnstore indexes.

    No streaming aggregates in 2008?
    I think columnstore indexes were introduced in 2012 with limited functionality - but this is the version we're using and cs indexes have proven themselves to be very useful for some of our forecasting scripts.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I was wondering how much time was taken going through a 900 line case statement 1 million times.
    It would certainly contribute to the runtime of the query.

  • ChrisM@Work - Thursday, January 24, 2019 6:36 AM

    ScottPletcher - Tuesday, January 22, 2019 10:42 AM

    You need to either:

    Cluster the table by timestamp (if that's how you (almost) always query against the table)
    Or
    Create a non-clus index on (Timestamp) include (TagName, Value)

    Either one of those will avoid having to sort all the data every time the query runs.

    A streaming aggregate, which is often an order of magnitude faster than a hash aggregate, won't be facilitated by either of these suggestions, but by a clustered index on TimeStamp, TagName or an ordinary index on (Timestamp,TagName) include (Value). 

    But the overall effect on performance might be worse, possibly much worse, with TagName in the keys, depending on the order the rows are inserted.  I don't know that a hash match aggregate will be that slow here, and it will certainly be much faster than having to sort all the rows, which is the real performance killer.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Thursday, January 24, 2019 8:12 AM

    ChrisM@Work - Thursday, January 24, 2019 6:36 AM

    ScottPletcher - Tuesday, January 22, 2019 10:42 AM

    You need to either:

    Cluster the table by timestamp (if that's how you (almost) always query against the table)
    Or
    Create a non-clus index on (Timestamp) include (TagName, Value)

    Either one of those will avoid having to sort all the data every time the query runs.

    A streaming aggregate, which is often an order of magnitude faster than a hash aggregate, won't be facilitated by either of these suggestions, but by a clustered index on TimeStamp, TagName or an ordinary index on (Timestamp,TagName) include (Value). 

    But the overall effect on performance might be worse, possibly much worse, with TagName in the keys, depending on the order the rows are inserted.  I don't know that a hash match aggregate will be that slow here, and it will certainly be much faster than having to sort all the rows, which is the real performance killer.

    Timestamp and TagName are *required* as the leading keys in any index to support a streaming aggregate in this case (a third column as the first column in the index _might_ also work provided that the value is the same for all qualifying rows). If either column is missing or both columns are present but there's a third column between them, then a sort operator would almost certainly appear in the plan - if the optimizer chose a streaming aggregate over a hash aggregate. And yes, that sort would be slow.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work - Thursday, January 24, 2019 8:22 AM

    ScottPletcher - Thursday, January 24, 2019 8:12 AM

    ChrisM@Work - Thursday, January 24, 2019 6:36 AM

    ScottPletcher - Tuesday, January 22, 2019 10:42 AM

    You need to either:

    Cluster the table by timestamp (if that's how you (almost) always query against the table)
    Or
    Create a non-clus index on (Timestamp) include (TagName, Value)

    Either one of those will avoid having to sort all the data every time the query runs.

    A streaming aggregate, which is often an order of magnitude faster than a hash aggregate, won't be facilitated by either of these suggestions, but by a clustered index on TimeStamp, TagName or an ordinary index on (Timestamp,TagName) include (Value). 

    But the overall effect on performance might be worse, possibly much worse, with TagName in the keys, depending on the order the rows are inserted.  I don't know that a hash match aggregate will be that slow here, and it will certainly be much faster than having to sort all the rows, which is the real performance killer.

    Timestamp and TagName are *required* as the leading keys in any index to support a streaming aggregate in this case (a third column as the first column in the index _might_ also work provided that the value is the same for all qualifying rows). If either column is missing or both columns are present but there's a third column between them, then a sort operator would almost certainly appear in the plan - if the optimizer chose a streaming aggregate over a hash aggregate. And yes, that sort would be slow.

    I'd be amazed if the SQL optimizer were sophisticated enough to analyze all the CASE statements and determine they relied only on TagName in addition to Timestamp and to be able to then do a more efficient aggregation.  I'm not saying SQL can't do it, but I'd like to see it to actually do it before I was sure it would be capable of that.

    Also, for each new value of Timestamp, SQL would clear the hash buckets, right?  As long as all the buckets for a given timestamp fit into memory, a hash agg isn't going to have perf issues, is it?

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply