Cumulative count distinct for a period of 3 months

  • Change your point of view.  Instead of looking back to see which customers were active, spread your customers' activities forward.  The following only takes one scan and one logical read.

    SELECT DATEPART(YEAR, a.reportdate) AS Reportyear
    , DATEPART(MONTH, a.reportdate) AS Reportmonth
    , COUNT(DISTINCT sd.customer_id) AS c_count
    FROM #SampleData AS sd
    CROSS APPLY (VALUES(EOMONTH(sd.transactiondate))
    ,(EOMONTH(sd.transactiondate, 1))
    ,(EOMONTH(sd.transactiondate, 2))
    ) a(reportdate) -- customer active months
    WHERE a.reportdate < GETDATE()
    GROUP BY a.reportdate

    Updated to use the optional EOMONTH() parameter instead of using DATEADD().

    Drew

    • This reply was modified 1 year, 10 months ago by  drew.allen.

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • It's an interesting approach Drew.  As I see it it's a partial self-CROSS JOIN then summarize instead of summarize then index JOIN.  My suspicion was confirmed using Jeff's test harness (which has an index (By_TDate_CustID) on 'transaction_date' and INCLUDE'ed the customer_id).  Well, I tested with 100,000 rows and 1,000,000 rows.  Jeff's test instance 'JBMTestSSD' must be quite powerful because I ran out of patience trying this with 10,000,000 rows.  Ha, it must be nice.  The result it seems the larger the row count the index JOIN advantage increases non-linearly

    /* Jeff's test table and same data */--===== If the table exists, drop it to make reruns in SSMS easier.
    DROP TABLE IF EXISTS #SampleData;
    GO
    --===== Create the table with what most people use for a Clustered PK.
    -- Probably not the right CI for this table, though.
    CREATE TABLE #SampleData
    (
    transaction_id BIGINT IDENTITY(1,1) PRIMARY KEY CLUSTERED
    ,date_transaction DATE NOT NULL
    ,customer_id INT NOT NULL
    ,sales_amount DECIMAL(9,2) NOT NULL
    )
    ;
    --===== Populate the table with 10 million rows of randomized data for 100,000 different customer_id's
    -- across 48 months of time.
    -- This takes advantage of "Minimal Logging" if you're NOT in the FULL Recovery Model
    -- and takes a little over 9 seconds on my box.
    INSERT INTO #SampleData WITH (TABLOCK)
    (date_transaction,customer_id,sales_amount)
    SELECT date_transaction = DATEADD(dd,ABS(CHECKSUM(NEWID())%DATEDIFF(dd,'01 JAN 2019','01 JAN 2023')),'01 JAN 2019')
    ,customer_id = ABS(CHECKSUM(NEWID())%100000)+1
    ,sales_amount = RAND(CHECKSUM(NEWID()))*1000
    FROM dbo.fnTally(1,100000)
    ORDER BY date_transaction
    ;
    --===== Create a non-clustered index to support the query we're working on.
    -- Same thing here. "Minimal Logging" if you're NOT in the FULL Recovery Model.
    -- Takes a little over 2 seconds on my box.
    CREATE INDEX By_TDate_CustID ON #SampleData(date_transaction) INCLUDE(customer_id)
    ;

    /* Drew's code */ SET STATISTICS TIME,IO ON;
    SELECT DATEPART(YEAR, a.reportdate) AS Reportyear
    , DATEPART(MONTH, a.reportdate) AS Reportmonth
    , COUNT(DISTINCT sd.customer_id) AS c_count
    --select count(*)
    FROM #SampleData AS sd
    CROSS APPLY (VALUES(EOMONTH(sd.date_transaction))
    ,(EOMONTH(sd.date_transaction, 1))
    ,(EOMONTH(sd.date_transaction, 2))
    ) a(reportdate) -- customer active months
    WHERE a.reportdate < GETDATE()
    GROUP BY a.reportdate
    SET STATISTICS TIME,IO OFF;

    /* Steve's code */ SET STATISTICS TIME,IO ON;
    with unq_dt_cte as (
    select distinct eomonth(date_transaction) t_dt
    from #sampledata)
    select datepart(year, t_dt) as Reportyear,
    datepart(month, t_dt) as ReportMonth,
    c.c_count
    from unq_dt_cte dt
    cross apply (select count(distinct s.customer_id)
    from #SampleData s
    where s.date_transaction > eomonth(dt.t_dt, -3)
    and s.date_transaction < dateadd(day, 1, dt.t_dt)) c(c_count)
    order by t_dt;
    SET STATISTICS TIME,IO OFF;

    /* Jeff's code */ SET STATISTICS TIME,IO ON
    ;
    WITH cteMonths AS
    (--==== Creates a list of months and month range dates for eadh month.
    SELECT LoMonth = DATEADD(mm,-2,mm.CurMonth)
    ,CurMonth = mm.CurMonth
    ,CoMonth = DATEADD(mm,1,mm.CurMonth)
    FROM #sampledata sd
    CROSS APPLY (VALUES(DATEADD(dd,1,EOMONTH(date_transaction,-1))))mm(CurMonth)
    GROUP BY DATEADD(dd,1,EOMONTH(date_transaction,-1))
    )--==== The "Display Layer"
    SELECT MonthDate = m.CurMonth
    ,Year# = DATEPART(yy,m.CurMonth)
    ,Month# = DATEPART(mm,m.CurMonth)
    ,DistinctCnt = c.Count3Mo
    ,PeriodDesc = CONCAT(SUBSTRING(CONVERT(NVARCHAR(20),m.LoMonth,106),4,20)
    ,' thru '
    ,SUBSTRING(CONVERT(NVARCHAR(20),m.CurMonth,106),4,20))
    FROM cteMonths m
    CROSS APPLY (--==== This gets the count of distinct IDs for each rolling 3 month period.
    SELECT COUNT(DISTINCT customer_id)
    FROM #sampledata sd
    WHERE sd.date_transaction >= m.LoMonth
    AND sd.date_transaction < m.CoMonth)c(Count3Mo)
    ORDER BY MonthDate
    ;
    SET STATISTICS TIME,IO OFF;

    Results from 100,000 row test

        (42 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Workfile'. Scan count 6, logical reads 480, physical reads 36, page server reads 0, read-ahead reads 444, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#SampleData_________________________________________________________________________________________________________00000001E792'. Scan count 1, logical reads 261, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 16 ms, elapsed time = 2335 ms.
    SQL Server parse and compile time:
    CPU time = 4 ms, elapsed time = 4 ms.

    (48 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#SampleData_________________________________________________________________________________________________________00000001E792'. Scan count 49, logical reads 1191, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 110 ms, elapsed time = 115 ms.
    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 5 ms.

    (48 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#SampleData_________________________________________________________________________________________________________00000001E792'. Scan count 49, logical reads 1191, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 62 ms, elapsed time = 257 ms.

    Results from 1,000,000 row test

        (42 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Workfile'. Scan count 16, logical reads 7040, physical reads 775, page server reads 0, read-ahead reads 6449, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#SampleData_________________________________________________________________________________________________________00000001E786'. Scan count 1, logical reads 2606, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 359 ms, elapsed time = 37091 ms.
    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 4 ms.

    (48 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#SampleData_________________________________________________________________________________________________________00000001E786'. Scan count 49, logical reads 10480, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 94 ms, elapsed time = 3762 ms.
    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 18 ms.

    (48 row(s) affected)

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
    Table '#SampleData_________________________________________________________________________________________________________00000001E786'. Scan count 49, logical reads 10480, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 15 ms, elapsed time = 4544 ms.

    Jeff and my code both scale similarly and there's tiny difference +/- run-to-run either way

     

    • This reply was modified 1 year, 10 months ago by  Steve Collins.

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

Viewing 2 posts - 16 through 16 (of 16 total)

You must be logged in to reply to this topic. Login to reply