Query tuning with conditional aggregation

  • Once you've experimented enough to know how it's done, building a sandbox for testing can be quick enough that it doesn't impact on your work. So here's a sandbox for testing out some of the assumptions made on this thread.
    I'm making the following assumptions:
    A streaming aggregate will outperform a hash aggregate
    A streaming aggregate will require an index on Timestamp and TagName
    Pre-aggregation as suggested by Jonathan AC Roberts will provide a performance lift.

    First, the sample data script:

    -- Create a sample data set
    IF OBJECT_ID('tempdb..#SampleData') IS NOT NULL DROP TABLE #SampleData;
    WITH
     E1 (N) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)), -- 10 rows
     E2 (N) AS (SELECT 1 FROM E1 a, E1 b), -- 100 rows
     E4 (N) AS (SELECT 1 FROM E2 a, E2 b), -- 10000 rows
     E8 (N) AS (SELECT 1 FROM E4 a, E4 b) -- 100000000 rows
    SELECT TOP(1000000)
     [Timestamp] = DATEADD(DAY,CHECKSUM(NEWID()) % 10,GETDATE()),
     [Tagname] = 'NC_S\Column ' + CAST(1 + (ABS(CHECKSUM(NEWID())) % 900) AS VARCHAR(20)),
     [Value] = ABS(CHECKSUM(NEWID())) % 1000
    INTO #SampleData
    FROM E8;

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Next a couple of queries, both built dynamically - one with preaggregation, one without:

    -------------------------------------
    -- test with 500 columns
    DECLARE @stmt VARCHAR(MAX) = '';
    WITH
     E1 (N) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)),
     E2 (N) AS (SELECT 1 FROM E1 a, E1 b),
     E4 (N) AS (SELECT 1 FROM E2 a, E2 b),
     iTally AS (SELECT TOP(500) n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
    FROM E4)
    SELECT @stmt = @stmt + ', [NC_S\Column ' + CAST(n AS VARCHAR(4)) + '] =  SUM(CASE WHEN TagName = ''NC_S\Column ' + CAST(n AS VARCHAR(4)) + ''' THEN [value] END) ' + CHAR(10)
    FROM iTally
    SET @stmt = ';WITH CTE AS
    (
    SELECT [Timestamp],
           TagName,
           SUM(value) value
      FROM #SampleData
     GROUP BY [Timestamp], TagName
    )SELECT
     [Timestamp]' + @stmt + 'FROM CTE
    GROUP BY [Timestamp]; ';
    EXEC(@stmt)
    ------------------------------------
    -- test with 500 columns
    DECLARE @stmt VARCHAR(MAX) = '';
    WITH
     E1 (N) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) d (n)),
     E2 (N) AS (SELECT 1 FROM E1 a, E1 b),
     E4 (N) AS (SELECT 1 FROM E2 a, E2 b),
     iTally AS (SELECT TOP(500) n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))
    FROM E4)
    SELECT @stmt = @stmt + ', [NC_S\Column ' + CAST(n AS VARCHAR(4)) + '] =  SUM(CASE WHEN TagName = ''NC_S\Column ' + CAST(n AS VARCHAR(4)) + ''' THEN [value] END) ' + CHAR(10)
    FROM iTally
    SET @stmt = 'SELECT
     [Timestamp]' + @stmt + 'FROM #SampleData
    GROUP BY [Timestamp]; ';
    EXEC(@stmt)
    -------------------------------------
    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Finally, four conditions to test:

    without preagg, single key clustered index: stream aggregate, 00:01:30
    without preagg, dual key clustered index: stream aggregate, 00:01:26
    with preagg, single key clustered index: hash aggregate x2, 00:00:05
    with preagg, dual key clustered index: stream aggregate x2, 00:00:03

    Conclusions:
    Preaggregation with the right index is, in this test harness, 30x faster than without preaggregation.
    The two-key index is irrelevant unless there's a preaggregation step.

    Adding more and more output columns to the query increases the execution time arithmetically:
    100 columns: 00:00:17
    200 columns: 00:00:35
    400 columns: 00:01:10
    800 columns: 00:02:23

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • We're just guess here.  Please post the full DDL for the Exampledb.dbo.SourceTable_EF278 table including all constraints and indexes.

    It would also be helpful if you if you provided some examples of what the the values in the TimeStamp and Value columns looked like.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 4 posts - 16 through 18 (of 18 total)

You must be logged in to reply to this topic. Login to reply