Using SUM() In CASE Statement / Best Practices

  • All,

    I'm working on building a semantic layer within an ETL database so as to improve our end to end performance. As part of this, I'm generating a layer on top of the extraction layer to generate a number of aggregations with smaller sets (I use that term loosely) of data on a different schema to test performance improvements (if any) compared to the existing plan.
    Without getting too far into the new design, one of my queries generates a sum aggregation as below

    SELECT
            fund.SourceName,
            fund.FundID,
            trans.InvestorID,
            fundDates.theDate AS CurrentDate,
            SUM(trans.Capital * (CASE WHEN batch.BatchGLDate <= fundDates.theDate AND batch.BatchStatus = 'Posted' THEN 1 ELSE 0 END)) AS CurrentValue
    FROM    TransAlloc trans
            INNER JOIN
       Batch batch ON trans.BatchID = batch.BatchID
            INNER JOIN
       Fund fund ON batch.LEID = fund.FundID
            INNER JOIN
            FundDates fundDates ON fund.FundID = fundDates.FundID
        WHERE    batch.batchStatus IN ( 'Posted', 'Held')
         AND    batch.BatchGLDate <= fundDates.theDate
        GROUP BY
            fund.SourceName,
            fund.FundID,
            trans.InvestorID,
            fundDates.theDate

    Trans table has 475k records Clustered index on TranID
    Batch table 365k Clustered index on BatchID
    Fund 1344 Clustered index on FundID
    (I've also created other covering indexes on these tables)
    FundDates 417984 - this table is populated with every date since '2017-12-31' for each fund:

    ;WITH allDates AS
      (SELECT CAST('2017-12-31' AS DATETIME) as theDate
      UNION ALL
      SELECT DATEADD(day, 1, allDates.theDate)
       FROM allDates
       WHERE DATEADD(day, 1, theDate) <= GETDATE()
      )

    SELECT fund.FundID, d.theDate
    INTO FundDates
    FROM    Fund fund,
      allDates d
    OPTION (MAXRECURSION 0)

    The initial rollout will run extremely slow due to the fact that the code needs to check on the conditions in the CASE statement for every batch date against each date in the funddates table (as well as if the batch status = 'Posted').
    I'm wondering if anyone could recommend another way of doing this.
    Bear in mind, this is just one aggregation, I have several others to be added to the query.

    Other options explored:
    1. Creating this as a table valued function with the FundID and Date as params. This returns very quickly but only runs for one date and one fund. I haven't tried using this within a dynamic SQL query to test performance yet, but this option would allow us to re-run for specific dates for specific funds.
    2. Create the aggregation as a sub-query and join it to the main query. 

    I can post the table definitions if need be but mostly I'm just looking for a nudge in the right direction in terms of performance benefit with such a query. Any help is appreciated.

    Regards

  • It's very difficult to troubleshoot performance problems without an execution plan (preferably the actual rather than estimated).  I did look at your query and see that you had a triangular join, which contributes to the performance problems.  I've tried to rewrite the query to get rid of the triangular join.

    ;
    WITH Batch_totals AS
    (
        SELECT
            batch.batchID
            batch.BatchGLDate,
            batch.batchStatus,
            CASE WHEN batch.batchStatus = 'Held' THEN 0 ELSE COUNT(batch.batchStatus) OVER(PARTITION BY batch.batchID, batch.batchStatus ORDER BY BatchGLDate ROWS UNBOUNDED PRECEDING) END AS CurrentCount
        FROM
            Batch
        WHERE batch.batchStatus IN ( 'Posted', 'Held')
    )
    SELECT
        fund.SourceName,
        fund.FundID,
        trans.InvestorID,
        fundDates.theDate AS CurrentDate,
        trans.Capital * batch.CurrentCount AS CurrentValue
    FROM TransAlloc trans
    INNER JOIN
        Fund fund ON batch.LEID = fund.FundID
    INNER JOIN
        FundDates fundDates ON fund.FundID = fundDates.FundID
    CROSS APPLY
    (
        SELECT TOP (1) CurrentCount
        FROM
            Batch_totals
        WHERE trans.BatchID = batch_totals.BatchID
            AND batch_totals.BatchGLDate <= fundDates.theDate
        ORDER BY batch_totals.batchStatus DESC, batch_totals.BatchGLDate DESC
    ) batch

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)

    Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
    I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.

    However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).

    The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
    Thanks again for the help

  • mitzyturbo - Tuesday, November 6, 2018 10:25 AM

    Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)

    Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
    I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.

    However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).

    The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
    Thanks again for the help

    No, you shouldn't be able to use "ELSE 1".  The whole purpose of the windowed function is to replace the SUM.  I had to guess at the formula, because you didn't provide sample data and expected results. 

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • mitzyturbo - Tuesday, November 6, 2018 10:25 AM

    Triangular Join - I knew there was a name for that (wrote my result set on a piece of paper that looked like this)

    Thanks Drew, I've been out of the SQL game a while, didn't know what stuff like ROWS UNBOUNDED PRECEDING was to perform these kind of running totals running totals. If I'm reading your code correctly, the CTE is a holder for all batches in scope with all Held batches set as 0 while all Posted batches are set as 1 and ordered by the batchdate
    I think I can sub in a ELSE 1 in the case statement instead of the partition by as batchsdsid is unique, it won't have more than one status, I don't need a running total of that here.

    However, the CTE makes sense as it gives the specific subset of batches that are in scope for the query (I discovered that my data has future dates that aren't in scope).

    The main performance drag is the SORT operation that comes into the execution plan when I SUM the trans.Capital amount. Let me play around with your version a bit, see what performance looks like and I'll update with the execution plan.
    Thanks again for the help

    If anyone has more interest in "Triangular Joins", here's the article where that graphic came from.
    http://www.sqlservercentral.com/articles/T-SQL/61539/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I got a big improvement by introducing a new index based on batchdate and batchstatus (disregard the cover index name)
    CREATE NONCLUSTERED INDEX [IDX_NI_CA_Batch_CoverIndex2] ON [dbo].[Batch]
    (
        [BatchGLDate] ASC
        ,[BatchStatus] 
    )
    INCLUDE ([BatchSDSID]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    GO
    Right now, its taking just over 44 seconds to generate just under 60k rows for one fund between 2017-12-31 and '2018-02-01'.

    Execution plan attached, the main bottlenecks are the SORTs used in both the CTE and the CROSS APPLY query.

    On the back of this I tried to take some of the burden off the CROSS APPLY query by adding a row number in the CTE and apply a filter for the first record, avoiding the use of a TOP 1:
    ;
    WITH Batch_totals AS
    (
      SELECT
       batch.batchSDSID,
       batch.BatchGLDate,
       batch.batchStatus,
       CASE WHEN batch.batchStatus = 'Held' THEN 0
            ELSE COUNT(batch.batchStatus)
                OVER(PARTITION BY batch.batchSDSID, batch.batchStatus ORDER BY BatchGLDate ROWS UNBOUNDED PRECEDING)
                END AS CurrentCount,
            ROW_NUMBER() OVER (PARTITION BY batch.batchSDSID ORDER BY BatchGLDate DESC, batch.batchStatus DESC) AS RowNum
      FROM
       SDSDW.Batch batch
      WHERE batch.batchStatus IN ( 'Posted', 'Held')
        AND batch.batchGLDate BETWEEN '2017-12-31' AND '2018-02-01'

    )

    SELECT
      fund.SourceName,
      fund.FundID,
      trans.InvestorID,
      fundDates.theDate AS CurrentDate,
      trans.TotalPartnersCapital * batch.CurrentCount AS CurrentValue
    FROM SDSDW.TransAlloc trans
    INNER JOIN    SDSDW.Fund fund
        ON trans.LESDSID = fund.FundSDSID
    INNER JOIN    SDSDW.FundDates fundDates
        ON fund.FundSDSID = fundDates.FundSDSID
    CROSS APPLY
    (
        SELECT CurrentCount    
      FROM
       Batch_totals
      WHERE trans.BatchSDSID = batch_totals.BatchSDSID
       AND batch_totals.BatchGLDate <= fundDates.theDate
            AND Rownum = 1
    ) batch
    WHERE trans.LESDSID = 891
        AND fundDates.theDate BETWEEN '2017-12-31' AND '2018-02-01'
    ORDER BY fundDates.theDate

    Huge gain, only took 1 second to return the 60k rows and totals match.
    I'll expand this out to see how it handles the different calc's (some will require different criteria) but this has been a huge help Drew
    Thanks again

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply