Help with performance - aggregating a lot of data

  • Hey, folks... I could use a hand from the performance gurus out there (test data and test query below).

    This particular query is returning about 1/10th of the actual data I'm working with. The execution plan that the test data returns isn't quite the same as the actual data, but the performance is about the same. (This version uses group by for the aggregates - I've also tested with over(partition by), but that version seems worse.)

    Note that in my actual dev environment, I'm seeing a clustered index seek on the TimeDetail table, not a scan.

    I'm dealing with source data where the table structure is fixed, and TPTB want this data real time, so moving to a data warehouse isn't an option at the moment. (Some of the data types in the test tables are specifically there to mimic my production data.)

    I've built more test data than I'm actually using, but the data is a *lot* cleaner than my production data.

    I'm hoping there's something obvious about the way I've got this written that will speed it up considerably... Thanks in advance!

    Test data setup:

    -- set up tables

    CREATE TABLE TimeDetail

    (

    docnbr VARCHAR(10) NOT NULL ,

    day1_reg FLOAT NOT NULL ,

    day1_ot1 FLOAT NOT NULL ,

    day1_ot2 FLOAT NOT NULL ,

    day2_reg FLOAT NOT NULL ,

    day2_ot1 FLOAT NOT NULL ,

    day2_ot2 FLOAT NOT NULL ,

    day3_reg FLOAT NOT NULL ,

    day3_ot1 FLOAT NOT NULL ,

    day3_ot2 FLOAT NOT NULL ,

    day4_reg FLOAT NOT NULL ,

    day4_ot1 FLOAT NOT NULL ,

    day4_ot2 FLOAT NOT NULL ,

    day5_reg FLOAT NOT NULL ,

    day5_ot1 FLOAT NOT NULL ,

    day5_ot2 FLOAT NOT NULL ,

    day6_reg FLOAT NOT NULL ,

    day6_ot1 FLOAT NOT NULL ,

    day6_ot2 FLOAT NOT NULL ,

    day7_reg FLOAT NOT NULL ,

    day7_ot1 FLOAT NOT NULL ,

    day7_ot2 FLOAT NOT NULL ,

    linenbr SMALLINT NOT NULL

    ) ;

    go

    ALTER TABLE [dbo].[TimeDetail] ADD CONSTRAINT [TimeDetail0] PRIMARY KEY CLUSTERED ([docnbr], [linenbr]) ON [PRIMARY]

    GO

    CREATE TABLE TimeHeader

    (

    empID VARCHAR(6) NOT NULL ,

    docnbr VARCHAR(20) NOT NULL ,

    weekEnding SMALLDATETIME NOT NULL ,

    tc_status CHAR(1) NOT NULL

    ) ;

    go

    ALTER TABLE [dbo].[TimeHeader] ADD CONSTRAINT [TimeHeader0] PRIMARY KEY CLUSTERED ([docnbr]) ON [PRIMARY] ;

    GO

    CREATE NONCLUSTERED INDEX [TimeHeader1] ON [dbo].[TimeHeader] ([empid]) WITH (FILLFACTOR=90) ON [PRIMARY] ;

    GO

    CREATE NONCLUSTERED INDEX [TimeHeader3] ON [dbo].[TimeHeader] ([empid], [weekEnding]) WITH (FILLFACTOR=90) ON [PRIMARY] ;

    GO

    CREATE NONCLUSTERED INDEX [xidxTimeHeader4] ON [dbo].[TimeHeader] ([tc_status], [weekEnding]) INCLUDE ([empid], [docnbr]) ON [PRIMARY] ;

    GO

    -- create test data (using Jeff Moden's tally table to create enough data for testing)

    SELECT TOP 4000

    IDENTITY ( INT,1,1 ) AS x

    INTO dbo.Tally

    FROM master.dbo.syscolumns sc1 ,

    master.dbo.syscolumns sc2 ;

    go

    ALTER TABLE dbo.Tally

    ADD CONSTRAINT pkTally

    PRIMARY KEY CLUSTERED (x) WITH FILLFACTOR = 100 ;

    GO

    INSERT INTO TimeHeader

    (

    empid ,

    weekending ,

    docnbr ,

    tc_status

    )

    ( SELECT empid = CAST(t.x AS VARCHAR(6)) ,

    weeks.weekEnding ,

    docnbr = 5000

    + ROW_NUMBER() OVER ( ORDER BY t.x, weeks.weekending ASC ) ,

    tc_status = 'P'

    FROM dbo.Tally t

    CROSS JOIN ( SELECT weekNbr = t.x ,

    weekEnding = DATEADD(wk, t.x,

    '2009-12-28')

    FROM dbo.Tally t

    WHERE t.x <= 156

    ) weeks

    WHERE t.x <= 4000

    ) ;

    go

    INSERT INTO dbo.TimeDetail

    (

    docnbr ,

    day1_reg ,

    day1_ot1 ,

    day1_ot2 ,

    day2_reg ,

    day2_ot1 ,

    day2_ot2 ,

    day3_reg ,

    day3_ot1 ,

    day3_ot2 ,

    day4_reg ,

    day4_ot1 ,

    day4_ot2 ,

    day5_reg ,

    day5_ot1 ,

    day5_ot2 ,

    day6_reg ,

    day6_ot1 ,

    day6_ot2 ,

    day7_reg ,

    day7_ot1 ,

    day7_ot2 ,

    linenbr

    )

    ( SELECT h.docnbr ,

    1 ,

    1.25 ,

    1.5 ,

    2 ,

    2.25 ,

    2.5 ,

    3 ,

    3.25 ,

    3.5 ,

    4 ,

    4.25 ,

    4.5 ,

    5 ,

    5.25 ,

    5.5 ,

    6 ,

    6.25 ,

    6.5 ,

    7 ,

    7.25 ,

    7.5 ,

    lines.linenbr

    FROM TimeHeader h

    CROSS JOIN ( SELECT linenbr = t.x

    FROM dbo.Tally t

    WHERE t.x <= 24

    ) lines

    ) ;

    GO

    Query:

    SELECT DailyHours.empid ,

    DailyHours.weekEnding ,

    RegHours = SUM(DailyHours.RegHrs) ,

    OT1Hours = SUM(DailyHours.OT1Hours),

    OT2Hours = SUM(DailyHours.OT2Hours)

    FROM ( SELECT empid ,

    weekEnding ,

    RegHrs ,

    OT1Hours ,

    OT2Hours

    FROM ( SELECT h.empid ,

    h.weekEnding ,

    d.day1_reg ,

    d.day1_ot1 ,

    d.day1_ot2 ,

    d.day2_reg ,

    d.day2_ot1 ,

    d.day2_ot2 ,

    d.day3_reg ,

    d.day3_ot1 ,

    d.day3_ot2 ,

    d.day4_reg ,

    d.day4_ot1 ,

    d.day4_ot2 ,

    d.day5_reg ,

    d.day5_ot1 ,

    d.day5_ot2 ,

    d.day6_reg ,

    d.day6_ot1 ,

    d.day6_ot2 ,

    d.day7_reg ,

    d.day7_ot1 ,

    d.day7_ot2

    FROM dbo.TimeDetail d

    INNER JOIN dbo.TimeHeader h ON d.docnbr = h.docnbr

    WHERE h.tc_status = 'P'

    AND h.weekEnding between '2010-01-01' AND '2010-10-27'

    ) hrs

    CROSS APPLY ( SELECT day1_reg ,

    day1_ot1 ,

    day1_ot2

    UNION ALL

    SELECT day2_reg ,

    day2_ot1 ,

    day2_ot2

    UNION ALL

    SELECT day3_reg ,

    day3_ot1 ,

    day3_ot2

    UNION ALL

    SELECT day4_reg ,

    day4_ot1 ,

    day4_ot2

    UNION ALL

    SELECT day5_reg ,

    day5_ot1 ,

    day5_ot2

    UNION ALL

    SELECT day6_reg ,

    day6_ot1 ,

    day6_ot2

    UNION ALL

    SELECT day7_reg ,

    day7_ot1 ,

    day7_ot2

    ) h ( RegHrs, OT1Hours, OT2Hours )

    ) DailyHours

    GROUP BY DailyHours.empid ,

    DailyHours.weekEnding;

    -Ki

  • Hey Ki,

    Well, with how convoluted this query is nothing's obvious yet. Some questions.

    Considering the sample data, should I be expected a result set that looks like this?:

    9662010-04-12 00:00:00672714756

    11602010-03-08 00:00:00672714756

    26882010-07-19 00:00:00672714756

    31012010-02-22 00:00:00672714756

    Is that just an artifact of the sample data or is something very wrong here?

    TimeDetail, this is a 'price' adjustment? No, that's not right either.

    What is TimeDetail? Why are there 24 lines per docnbr? I think I'm getting caught up in the resultant artifacts from the sample data but it helps me to figure out if there's alternatives.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Hi Ki

    I now where you're going with this construction (APPLY with row generation), it makes sense and looks cool - but it's quite expensive. In this case, plain old-fashioned syntax works twice as fast:

    SELECT

    h.empid,

    h.weekEnding ,

    reg = SUM(d.day1_reg + d.day2_reg + d.day3_reg + d.day4_reg + d.day5_reg + d.day6_reg + d.day7_reg),

    ot1 = SUM(d.day1_ot1 + d.day2_ot1 + d.day3_ot1 + d.day4_ot1 + d.day5_ot1 + d.day6_ot1 + d.day7_ot1),

    ot2 = SUM(d.day1_ot2 + d.day2_ot2 + d.day3_ot2 + d.day4_ot2 + d.day5_ot2 + d.day6_ot2 + d.day7_ot2)

    FROM dbo.TimeDetail d

    INNER JOIN dbo.TimeHeader h ON d.docnbr = h.docnbr

    WHERE h.tc_status = 'P'

    AND h.weekEnding between '2010-01-01' AND '2010-10-27'

    GROUP BY h.empid, h.weekEnding;

    Assuming of course I've read your code and requirements correctly...


    [font="Arial"]Low-hanging fruit picker and defender of the moggies[/font]

    For better assistance in answering your questions, please read this[/url].


    Understanding and using APPLY, (I)[/url] and (II)[/url] Paul White[/url]

    Hidden RBAR: Triangular Joins[/url] / The "Numbers" or "Tally" Table: What it is and how it replaces a loop[/url] Jeff Moden[/url]

  • Evil Kraig F (10/27/2011)


    Hey Ki,

    Well, with how convoluted this query is nothing's obvious yet. Some questions.

    Considering the sample data, should I be expected a result set that looks like this?:

    9662010-04-12 00:00:00672714756

    11602010-03-08 00:00:00672714756

    26882010-07-19 00:00:00672714756

    31012010-02-22 00:00:00672714756

    Is that just an artifact of the sample data or is something very wrong here?

    TimeDetail, this is a 'price' adjustment? No, that's not right either.

    What is TimeDetail? Why are there 24 lines per docnbr? I think I'm getting caught up in the resultant artifacts from the sample data but it helps me to figure out if there's alternatives.

    Craig (or should I spell it wrong since you changed your handle?),

    You're right - the sample data is making it hard for you. It's an archaic timecard hours tracking system. There are 24 lines per docnbr because I didn't feel like generating random numbers of lines for sample data... 🙂

    Chris caught me with the actual issue - this code has morphed over the last couple of weeks and the requirements have changed enough that I kept code that used to be necessary and isn't anymore. A simple rewrite back to basic stuff should help - I'll test that shortly.

    Thanks!

    -Ki

  • ChrisM@home (10/27/2011)


    Hi Ki

    I now where you're going with this construction (APPLY with row generation), it makes sense and looks cool - but it's quite expensive. In this case, plain old-fashioned syntax works twice as fast:

    SELECT

    h.empid,

    h.weekEnding ,

    reg = SUM(d.day1_reg + d.day2_reg + d.day3_reg + d.day4_reg + d.day5_reg + d.day6_reg + d.day7_reg),

    ot1 = SUM(d.day1_ot1 + d.day2_ot1 + d.day3_ot1 + d.day4_ot1 + d.day5_ot1 + d.day6_ot1 + d.day7_ot1),

    ot2 = SUM(d.day1_ot2 + d.day2_ot2 + d.day3_ot2 + d.day4_ot2 + d.day5_ot2 + d.day6_ot2 + d.day7_ot2)

    FROM dbo.TimeDetail d

    INNER JOIN dbo.TimeHeader h ON d.docnbr = h.docnbr

    WHERE h.tc_status = 'P'

    AND h.weekEnding between '2010-01-01' AND '2010-10-27'

    GROUP BY h.empid, h.weekEnding;

    Assuming of course I've read your code and requirements correctly...

    *headdesk*

    This started out as a fairly complex unpivot with a couple of different tables in the mix. The cross apply was a *whole* lot faster than the unpivot was... but as the requirements for this project have changed, I apparently got too close to take the necessary step back and actually *look* at what I had left. Cool is overrated. I'm far happier with code that doesn't lock up the system when 20 people decide to run the same report for 10 years worth of data all at once.

    I shall now hit my had repeatedly on the desk and go test this...

    Thanks for the second set of eyes!

    -Ki

  • No worries Ki. There's an aspect of this which is interesting to me and I'll continue to work with it - at home, they wouldn't allow these volumes at work - and I'll post back if it throws up anything which might be useful to you.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (10/28/2011)


    No worries Ki. There's an aspect of this which is interesting to me and I'll continue to work with it - at home, they wouldn't allow these volumes at work - and I'll post back if it throws up anything which might be useful to you.

    Now here's the interesting thing - in my dev environment against the *actual* data involved, the two versions appear to run at exactly the same speed. Doesn't change the fact that simpler code is better, and my "left over" stuff was needlessly overcomplicating what turned into a much simpler data set that I was originally going after. :blush:

    Either way, it's still too slow. I'll test against the production hardware over the weekend (I'm lucky I've already got a maintenance window planned) - I want to see what it actually looks like on decent hardware before deciding if I'm going to start arguing about whether or not "real time" means the same thing to the people asking the question as it does to me.

    -Ki

  • Kiara (10/28/2011)


    ChrisM@Work (10/28/2011)


    No worries Ki. There's an aspect of this which is interesting to me and I'll continue to work with it - at home, they wouldn't allow these volumes at work - and I'll post back if it throws up anything which might be useful to you.

    Now here's the interesting thing - in my dev environment against the *actual* data involved, the two versions appear to run at exactly the same speed. Doesn't change the fact that simpler code is better, and my "left over" stuff was needlessly overcomplicating what turned into a much simpler data set that I was originally going after. :blush:

    Either way, it's still too slow. I'll test against the production hardware over the weekend (I'm lucky I've already got a maintenance window planned) - I want to see what it actually looks like on decent hardware before deciding if I'm going to start arguing about whether or not "real time" means the same thing to the people asking the question as it does to me.

    The data's got some useful features in it...running your query and the simplified one last night, using the original filters, both showed a clustered index scan in the plan. When I cut the number of rows down by filtering (the simplified query) for 3 month's worth of data, this changed to a clustered index seek as you might expect, but it took more than twice as long to run.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (10/28/2011)


    Kiara (10/28/2011)


    ChrisM@Work (10/28/2011)


    No worries Ki. There's an aspect of this which is interesting to me and I'll continue to work with it - at home, they wouldn't allow these volumes at work - and I'll post back if it throws up anything which might be useful to you.

    Now here's the interesting thing - in my dev environment against the *actual* data involved, the two versions appear to run at exactly the same speed. Doesn't change the fact that simpler code is better, and my "left over" stuff was needlessly overcomplicating what turned into a much simpler data set that I was originally going after. :blush:

    Either way, it's still too slow. I'll test against the production hardware over the weekend (I'm lucky I've already got a maintenance window planned) - I want to see what it actually looks like on decent hardware before deciding if I'm going to start arguing about whether or not "real time" means the same thing to the people asking the question as it does to me.

    The data's got some useful features in it...running your query and the simplified one last night, using the original filters, both showed a clustered index scan in the plan. When I cut the number of rows down by filtering (the simplified query) for 3 month's worth of data, this changed to a clustered index seek as you might expect, but it took more than twice as long to run.

    The clustered index scan/seek thing happens with the actual data, too - and I've seen the same results you are. Scratched my head at that, and figured I'd dig more once I'd gotten speed down to something I was happier with.

    If I lose the aggregates and just pull the summarized data by row, it actually runs much more quickly. Since the end result of this is an SSRS report, I'm also going to see what happens if I push the aggregation process over to my report server. I usually like to do the heavy lifting on the SQL side, but now I'm curious.

    -Ki

  • There are a couple of things I want to confirm using your data.

    Firstly, IIRC aggregating the columns separately then adding up the aggregates is quicker than aggregating the sum of the columns across the row.

    Secondly, although you know the order of the data in the child table, I want to force it into the order of the aggregation - which means duplicating two of the parent columns in the child table then creating a clustered index over them, possibly with the original child key as a tie-breaker to get a unique clustered index. You'll still need an index on the original child key :/ but I've got a hunch that it will make sufficient difference to justify it.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work (10/28/2011)


    There are a couple of things I want to confirm using your data.

    Firstly, IIRC aggregating the columns separately then adding up the aggregates is quicker than aggregating the sum of the columns across the row.

    Secondly, although you know the order of the data in the child table, I want to force it into the order of the aggregation - which means duplicating two of the parent columns in the child table then creating a clustered index over them, possibly with the original child key as a tie-breaker to get a unique clustered index. You'll still need an index on the original child key :/ but I've got a hunch that it will make sufficient difference to justify it.

    Against the actual data, aggregating the columns separately is just a little quicker than aggregating the sum of the columns across the rows. Slightly higher CPU time, slightly lower elapsed time. But I'll take small improvements. 🙂

    From a real life perspective, I can't change the table structure in this particular case. But I can stage the data differently before it ever gets near this type of reporting - it's not a complicated set of data to send over to a data warehouse, and the response time I'll get on the reporting side may well be worth having the "what does real time actually mean" discussion with my user community.

    -Ki

  • How many rows does the real table have? And I probably missed it above but what is the actual DDL for this table and any indexes it may have? Or is the test code a 100% accurate representation?

    The reason I ask, of course, is I'm going to build the table to it's full size on my own and then experiment with a technique that Peter Larsson dubbed as "pre-aggregation" and mix it with one that I call "Divide'n'Conquer".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/28/2011)


    How many rows does the real table have? And I probably missed it above but what is the actual DDL for this table and any indexes it may have? Or is the test code a 100% accurate representation?

    The reason I ask, of course, is I'm going to build the table to it's full size on my own and then experiment with a technique that Peter Larsson dubbed as "pre-aggregation" and mix it with one that I call "Divide'n'Conquer".

    The actual table that TimeDetail is based on is about 42 million rows. (I'm expecting to have to join it to a couple of companion tables and end up with about 135 million rows of source data before this is done.)

    The DDL I posted is accurate for data types, relationships, and indexes, with two exceptions. I dropped columns that aren't actually used in this particular query (although I'm expecting some of them to come into play later - I'm reasonably sure that the end result of this little project is going to be "cool - now can we see this data sliced in six additional ways?" - which is one of the reasons I'm so concerned about performance for this seemingly simple requirement.)

    There's only one index missing from the test data, and it's on the header table.

    If you want, I'll add in all the additional columns and the missing index and repost the DDL. Your call - just let me know.

    -Ki

  • Nope. Good to go here. My only other question is how many rows does your test data jig create? I started it this morning in TempDB and stopped it when TempDb had grown to 10GB.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • A little over 14 million. But I didn't see anything like that kind of tempdb growth... My current test tempdb is only about 8 MB with 6 MB free.

    -Ki

Viewing 15 posts - 1 through 15 (of 53 total)

You must be logged in to reply to this topic. Login to reply