Query calculation over a partition

  • Wonder if you can help me with a query problem.

    I have the following style table that runs into 10,000,000+ rows.

    CREATE TABLE #example (

    [customerId] [smallint] NOT NULL,

    [window] [nvarchar](50) NULL,

    [product] [smallint] NOT NULL,

    [sales] [smallint] NULL)

    Go

    INSERT INTO #example

    (customerId, window,product, sales)

    values (1,'0-30', 1, 100)

    ,(1,'31-60', 2, 110)

    ,(1,'61-90', 3, 130)

    ,(2,'0-30', 1, null)

    ,(2,'31-60', 2, null)

    ,(2,'61-90', 4,null )

    ,(2,'91-120', 5, null)

    ,(3,'0-30', 3,120)

    ,(3,'31-60', 4, null)

    ,(3,'61-90', 6, null)

    GO

    I need to create a calculated column called Class that contains 0 if all the sales for each customer is null for all the windows else 1.

    For instance, CustomerId 1, 3 would be 1, CustomerId 2 would be 0.

    I suppose I can join to a grouped by query but wondered what the optimal query might be for a table this size.

    Best

    Lee

  • SELECT *, CAST(COUNT(sales) OVER(PARTITION BY e.customerId) AS BIT)

    FROM #example e

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • This could be quite a bit more efficient. That COUNT gets especially expensive as the number of matching rows goes up.

    SELECT *,

    CASE WHEN s.sf = 1 THEN 1 ELSE 0 END AS sales_flag

    FROM #example e

    OUTER APPLY (SELECT TOP 1 1 AS sf

    FROM #example e2

    WHERE e.customerid = e2.customerid

    AND e2.sales IS NOT NULL) AS s

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Since you are on a 2012 forum then why not use a window function?

    😎

    Note that a POC index like this would be very helpful

    CREATE NONCLUSTERED INDEX NCLIDX_#EXAMPLE_CUST_SALES_INCL_WIN_PRODUCT ON #example

    (customerId ASC, sales ASC) INCLUDE ( window,product);

    SELECT

    E.customerId

    ,E.window

    ,E.product

    ,E.sales

    ,SIGN(ISNULL(MAX(E.sales) OVER

    (

    PARTITION BY E.customerId

    ),0)) AS FLG

    FROM #example E;

    Output

    customerId window product sales FLG

    ---------- -------- ------- ------ -----------

    1 0-30 1 100 1

    1 31-60 2 110 1

    1 61-90 3 130 1

    2 0-30 1 NULL 0

    2 31-60 2 NULL 0

    2 61-90 4 NULL 0

    2 91-120 5 NULL 0

    3 31-60 4 NULL 1

    3 61-90 6 NULL 1

    3 0-30 3 120 1

  • Thanks all for sharing. I'll let you know which has the highest performance. Haven't used the SIGN() function before 🙂

    Best

    Lee

  • SIGN is one of the things used in the '60s to develop characteristic functions (the precursor to the modern CASE statement), which allowed incredibly powerful data processing in a single pass through the data. If you think we are constrained by IO now just imagine what it was like then. 😀

    Oh, I suspect that the MAX will suffer the same issue as COUNT when there isn't a POC index in place. It can't shortcut like TOP 1 (or EXISTS, which could possibly be employed here too) can and will have to touch all rows. Having a good index in place will most certainly help both MAX and TOP 1 (I think) though.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • select customerId, window, product, sales,

    ( Case when count(sales) over(partition by customerid order by customerid) >0 then 1 else 0 end)as class

    from #example

  • santiagoc93 (5/27/2016)


    select customerId, window, product, sales,

    ( Case when count(sales) over(partition by customerid order by customerid) >0 then 1 else 0 end)as class

    from #example

    Very nice!

    😎

    The addition of the ORDER BY eliminates the table spool operators and the query turns into a single index scan with stream aggregate if there is a POC index on the table .

  • Eirikur Eiriksson (5/28/2016)


    santiagoc93 (5/27/2016)


    select customerId, window, product, sales,

    ( Case when count(sales) over(partition by customerid order by customerid) >0 then 1 else 0 end)as class

    from #example

    Very nice!

    😎

    The addition of the ORDER BY eliminates the table spool operators and the query turns into a single index scan with stream aggregate if there is a POC index on the table .

    But it still suffers from the "count everything" problem which gets ugly as the number of matching rows increases per customerid.

    It would be nice to see some performance comparisons of the methods for this since it is a very common request. Something tickles the back of my mind though that Itzik has already done this...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (5/28/2016)


    Eirikur Eiriksson (5/28/2016)


    santiagoc93 (5/27/2016)


    select customerId, window, product, sales,

    ( Case when count(sales) over(partition by customerid order by customerid) >0 then 1 else 0 end)as class

    from #example

    Very nice!

    😎

    The addition of the ORDER BY eliminates the table spool operators and the query turns into a single index scan with stream aggregate if there is a POC index on the table .

    But it still suffers from the "count everything" problem which gets ugly as the number of matching rows increases per customerid.

    It would be nice to see some performance comparisons of the methods for this since it is a very common request. Something tickles the back of my mind though that Itzik has already done this...

    You are absolutely right on the "count everything" problem which makes this method at least 4 times slower than the TOP 1 one.

    😎

    My thought is that pre-grouping without aggregation will be the fastest way of doing this, will peace together an example/test harness when I have the time.

  • Pre-grouping works quite well in this case, especially if there is a filtered index on the the table for customerID where sales is not null.

    😎

    ;WITH CUSTOMERS_SALE AS

    (

    SELECT

    SSF.customerId

    FROM #example SSF

    WHERE SSF.sales IS NOT NULL

    GROUP BY SSF.customerId

    )

    SELECT

    SF.customerId

    ,SF.window

    ,SF.product

    ,SF.sales

    ,SIGN(ISNULL(CSALE.customerId,0)) AS SALE_FLAG

    FROM #example SF

    LEFT OUTER JOIN CUSTOMERS_SALE CSALE

    ON SF.customerId = CSALE.customerId;

    Quick and simple test harness

    USE tempdb;

    GO

    SET NOCOUNT ON;

    --/*

    IF OBJECT_ID(N'dbo.TBL_SAMPLE_SALE_FLAG') IS NOT NULL DROP TABLE dbo.TBL_SAMPLE_SALE_FLAG;

    CREATE TABLE dbo.TBL_SAMPLE_SALE_FLAG

    (

    SSF_ID INT NOT NULL CONSTRAINT PK_DBO_TBL_SAMPLE_SALE_FLAG_SALE_SSF_ID PRIMARY KEY CLUSTERED

    ,CUST_ID INT NOT NULL

    ,WIND_ID INT NOT NULL

    ,PROD_ID INT NOT NULL

    ,SALE_VAL INT NULL

    );

    DECLARE @SAMPLE_SIZE INT = 1000000;

    DECLARE @WINDOW_COUNT INT = 12;

    DECLARE @CUSTOMER_COUNT INT = 500000;

    DECLARE @PRODUCT_COUNT INT = 1000;

    DECLARE @NO_SALE_SEED INT = 3;

    DECLARE @MAX_SALE_VAL INT = 10000;

    ;WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    INSERT INTO dbo.TBL_SAMPLE_SALE_FLAG WITH (TABLOCK) (SSF_ID,CUST_ID,WIND_ID,PROD_ID,SALE_VAL)

    SELECT

    NM.N AS SSF_ID

    ,ABS(CHECKSUM(NEWID())) % @CUSTOMER_COUNT AS CUST_ID

    ,ABS(CHECKSUM(NEWID())) % @WINDOW_COUNT AS WIND_ID

    ,ABS(CHECKSUM(NEWID())) % @PRODUCT_COUNT AS PROD_ID

    ,ABS(CHECKSUM(NEWID())) % @MAX_SALE_VAL

    + NULLIF(ABS(CHECKSUM(NEWID())) % @NO_SALE_SEED,0) AS SALE_VAL

    FROM NUMS NM;

    -- POC INDICES

    CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_SAMPLE_SALE_FLAG_CUST_ID_SALE_VAL_INCL_WIN_ID_PROD_ID

    ON dbo.TBL_SAMPLE_SALE_FLAG (CUST_ID ASC, SALE_VAL ASC) INCLUDE (WIND_ID,PROD_ID);

    CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_SAMPLE_SALE_FLAG_CUST_ID_SALE_VAL

    ON dbo.TBL_SAMPLE_SALE_FLAG (CUST_ID ASC,SALE_VAL ASC) WHERE (SALE_VAL IS NOT NULL);

    CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_SAMPLE_SALE_FLAG_SALE_VAL_INCL_CUST_ID_WIND_ID

    ON dbo.TBL_SAMPLE_SALE_FLAG (SALE_VAL ASC) INCLUDE (CUST_ID,WIND_ID) WHERE (SALE_VAL IS NOT NULL);

    -- */

    DECLARE @timer TABLE (T_TEXT VARCHAR(50) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT(SYSDATETIME()));

    DECLARE @INT_BUCKET01 INT = 0;

    DECLARE @INT_BUCKET02 INT = 0;

    DECLARE @INT_BUCKET03 INT = 0;

    DECLARE @INT_BUCKET04 INT = 0;

    DECLARE @BIT_BUCKET01 INT = 0;

    -- First round

    RAISERROR ('--- DRY RUN ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('DRY RUN');

    SELECT

    @INT_BUCKET01 = E.CUST_ID

    ,@INT_BUCKET02 = E.WIND_ID

    ,@INT_BUCKET03 = E.PROD_ID

    ,@INT_BUCKET04 = E.SALE_VAL

    FROM dbo.TBL_SAMPLE_SALE_FLAG E

    ORDER BY CUST_ID ASC

    ,SALE_VAL ASC

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('DRY RUN');

    RAISERROR ('--- MAX ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('MAX');

    SELECT

    @INT_BUCKET01 = E.CUST_ID

    ,@INT_BUCKET02 = E.WIND_ID

    ,@INT_BUCKET03 = E.PROD_ID

    ,@INT_BUCKET04 = E.SALE_VAL

    ,@BIT_BUCKET01 = SIGN(ISNULL(CONVERT(INT,MAX(E.SALE_VAL) OVER

    (

    PARTITION BY E.CUST_ID

    ORDER BY E.CUST_ID

    ),0),0)) --AS FLG

    FROM dbo.TBL_SAMPLE_SALE_FLAG E

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('MAX');

    RAISERROR ('--- COUNT ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('COUNT');

    SELECT

    @INT_BUCKET01 = E.CUST_ID

    ,@INT_BUCKET02 = E.WIND_ID

    ,@INT_BUCKET03 = E.PROD_ID

    ,@INT_BUCKET04 = E.SALE_VAL

    ,@BIT_BUCKET01 = CASE WHEN COUNT(E.SALE_VAL) OVER

    (

    PARTITION BY E.CUST_ID

    ORDER BY E.CUST_ID

    ) > 0 THEN 1 ELSE 0 END --AS FLG

    FROM dbo.TBL_SAMPLE_SALE_FLAG E

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('COUNT');

    RAISERROR ('--- TOP ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('TOP');

    SELECT

    @INT_BUCKET01 = e.CUST_ID

    ,@INT_BUCKET02 = e.WIND_ID

    ,@INT_BUCKET03 = e.PROD_ID

    ,@INT_BUCKET04 = e.SALE_VAL

    ,@BIT_BUCKET01 = CASE WHEN s.sf = 1 THEN 1 ELSE 0 END --AS sales_flag

    FROM dbo.TBL_SAMPLE_SALE_FLAG e

    OUTER APPLY (SELECT TOP 1 1 AS sf

    FROM dbo.TBL_SAMPLE_SALE_FLAG e2

    WHERE e.CUST_ID = e2.CUST_ID

    AND e2.SALE_VAL IS NOT NULL) AS s

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('TOP');

    RAISERROR ('--- PRE GROUP ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('PRE GROUP');

    ;WITH CUSTOMERS_SALE AS

    (

    SELECT

    SSF.CUST_ID

    FROM dbo.TBL_SAMPLE_SALE_FLAG SSF

    WHERE SSF.SALE_VAL IS NOT NULL

    GROUP BY SSF.CUST_ID

    )

    SELECT

    @INT_BUCKET01 = SF.CUST_ID

    ,@INT_BUCKET02 = SF.WIND_ID

    ,@INT_BUCKET03 = SF.PROD_ID

    ,@INT_BUCKET04 = SF.SALE_VAL

    ,@BIT_BUCKET01 = SIGN(ISNULL(CSALE.CUST_ID,0)) --AS SALE_FLAG

    FROM dbo.TBL_SAMPLE_SALE_FLAG SF

    LEFT OUTER JOIN CUSTOMERS_SALE CSALE

    ON SF.CUST_ID = CSALE.CUST_ID

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('PRE GROUP');

    -- Second round

    RAISERROR ('--- DRY RUN 02 ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('DRY RUN 02');

    SELECT

    @INT_BUCKET01 = E.CUST_ID

    ,@INT_BUCKET02 = E.WIND_ID

    ,@INT_BUCKET03 = E.PROD_ID

    ,@INT_BUCKET04 = E.SALE_VAL

    FROM dbo.TBL_SAMPLE_SALE_FLAG E

    ORDER BY CUST_ID ASC

    ,SALE_VAL ASC

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('DRY RUN 02');

    RAISERROR ('--- MAX 02 ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('MAX 02');

    SELECT

    @INT_BUCKET01 = E.CUST_ID

    ,@INT_BUCKET02 = E.WIND_ID

    ,@INT_BUCKET03 = E.PROD_ID

    ,@INT_BUCKET04 = E.SALE_VAL

    ,@BIT_BUCKET01 = SIGN(ISNULL(CONVERT(INT,MAX(E.SALE_VAL) OVER

    (

    PARTITION BY E.CUST_ID

    ORDER BY E.CUST_ID

    ),0),0)) --AS FLG

    FROM dbo.TBL_SAMPLE_SALE_FLAG E

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('MAX 02');

    RAISERROR ('--- COUNT 02 ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('COUNT 02');

    SELECT

    @INT_BUCKET01 = E.CUST_ID

    ,@INT_BUCKET02 = E.WIND_ID

    ,@INT_BUCKET03 = E.PROD_ID

    ,@INT_BUCKET04 = E.SALE_VAL

    ,@BIT_BUCKET01 = CASE WHEN COUNT(E.SALE_VAL) OVER

    (

    PARTITION BY E.CUST_ID

    ORDER BY E.CUST_ID

    ) > 0 THEN 1 ELSE 0 END --AS FLG

    FROM dbo.TBL_SAMPLE_SALE_FLAG E

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('COUNT 02');

    RAISERROR ('--- TOP 02 ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('TOP 02');

    SELECT

    @INT_BUCKET01 = e.CUST_ID

    ,@INT_BUCKET02 = e.WIND_ID

    ,@INT_BUCKET03 = e.PROD_ID

    ,@INT_BUCKET04 = e.SALE_VAL

    ,@BIT_BUCKET01 = CASE WHEN s.sf = 1 THEN 1 ELSE 0 END --AS sales_flag

    FROM dbo.TBL_SAMPLE_SALE_FLAG e

    OUTER APPLY (SELECT TOP 1 1 AS sf

    FROM dbo.TBL_SAMPLE_SALE_FLAG e2

    WHERE e.CUST_ID = e2.CUST_ID

    AND e2.SALE_VAL IS NOT NULL) AS s

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('TOP 02');

    RAISERROR ('--- PRE GROUP 02 ---',0,0) WITH NOWAIT;

    INSERT INTO @timer(T_TEXT) VALUES('PRE GROUP 02');

    ;WITH CUSTOMERS_SALE AS

    (

    SELECT

    SSF.CUST_ID

    FROM dbo.TBL_SAMPLE_SALE_FLAG SSF

    WHERE SSF.SALE_VAL IS NOT NULL

    GROUP BY SSF.CUST_ID

    )

    SELECT

    @INT_BUCKET01 = SF.CUST_ID

    ,@INT_BUCKET02 = SF.WIND_ID

    ,@INT_BUCKET03 = SF.PROD_ID

    ,@INT_BUCKET04 = SF.SALE_VAL

    ,@BIT_BUCKET01 = SIGN(ISNULL(CSALE.CUST_ID,0)) --AS SALE_FLAG

    FROM dbo.TBL_SAMPLE_SALE_FLAG SF

    LEFT OUTER JOIN CUSTOMERS_SALE CSALE

    ON SF.CUST_ID = CSALE.CUST_ID

    OPTION (MAXDOP 1);

    INSERT INTO @timer(T_TEXT) VALUES('PRE GROUP 02');

    SELECT

    T.T_TEXT

    ,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

    FROM @timer T

    GROUP BY T.T_TEXT

    ORDER BY DURATION;

    Results on 2nd Gen i5 / SQL Server 2014

    T_TEXT DURATION

    --------------- -----------

    DRY RUN 02 190001

    DRY RUN 200000

    PRE GROUP 440000

    PRE GROUP 02 440000

    TOP 2530004

    TOP 02 2570004

    MAX 02 10675016

    COUNT 10700015

    MAX 10796016

    COUNT 02 10810015

    Edit: Corrected the wrong "pre-group" code snippet.

  • I had a the best results from this query. Works really well over the huge table. Thanks 🙂

  • leehbi (6/2/2016)


    I had a the best results from this query. Works really well over the huge table. Thanks 🙂

    Which query is "this" query? A lot of them have been posted.

    Never mind... it's gotta be the one marked as "This worked for the OP". :blush:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/4/2016)


    leehbi (6/2/2016)


    I had a the best results from this query. Works really well over the huge table. Thanks 🙂

    Which query is "this" query? A lot of them have been posted.

    Never mind... it's gotta be the one marked as "This worked for the OP". :blush:

    Good question, huge difference in performance but all suggestions return the correct results.

    😎

  • Thanks again for sharing. Modern T-SQL gives us many ways to skin a cat.

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply