Odd (to me) Behavior with NTILE() and NULL Values in Source Data

  • I think in this case you weren't hijacking at all (I believe you said you arrived here via Google for a similar issue), and I've appreciated the additional testing and discussion. Thanks for adding additional value to the thread,

    Rich

  • Just a quick update:

    I have a good solution that I'm still testing (note my earlier comment). It has taken me longer than expected. I'll try to post it in the morning.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Ok, here we go. This may be a bit much to follow but the key take away is that I'm showing you an excellent alternative to NTILE using a tally table[/url] and CROSS APPLY[/url].

    Code for the tally table

    Note that this solution uses a permanent tally table (a CTE tally or "getnumbers" function will result in much poorer performance.)

    IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;

    IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;

    CREATE TABLE dbo.tally (N int NOT NULL);

    INSERT dbo.tally

    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a, sys.all_columns b;

    ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR=100;

    ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);

    The NTally Inline Table Valued Function

    Play around with the example code inside the comments to better understand how you would use dbo.NTally to replace NTile.

    IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;

    GO

    CREATE FUNCTION dbo.NTally(@tiles bigint, @rows bigint)

    /****************************************************************************************

    Purpose:

    Returns a tally table with "tile groups" and can be used as an alternative to the T-SQL

    NTILE function introduced in SQL Server 2005. See the usage examples below for more

    details on how to use this function.

    Compatibility:

    SQL Server 2005+

    Syntax:

    SELECT rn, tile

    FROM dbo.NTally(@tiles, @rows);

    Parameters:

    @tiles = bigint; requested number of tile groups (same as the parameter passed to NTILE)

    @rows = bigint; the number of rows to be "tiled" (have group number assigned to it)

    Return Types:

    Inline Table Valued Function returns:

    rn = bigint; a row number beginning with 1 and ending with @rows

    tile = int; a "tile number" or group number the same

    Developer Notes:

    1. Requires a tally table named dbo.tally. Run the code below to create a correctly

    indexed tally table (note the "Beginning" and "End" of tally code).

    --===== Beginning of dbo.tally code

    -- Drop if NTALLY function and tally table if they exist

    IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;

    IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;

    -- Create the tally table

    CREATE TABLE dbo.tally (N int NOT NULL);

    -- Insert the numbers 1 through 1,000,0000 into dbo.tally

    INSERT dbo.tally

    SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a, sys.all_columns b;

    -- Create required primary key, clustered index and unique nonclustered indexes

    ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N)

    WITH FILLFACTOR=100;

    -- Note that, when present, the optimizer will chose the unique nonclustered index

    -- which gets the job done with fewer reads. This index is mandetory for dbo.NTally

    -- to function optimally

    ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);

    --===== End of tally table code

    Note that, if you need to use a different schema and/or name for your tally table

    make sure to change the tally table schema/name in the function to match. To learn

    more about tally tables see http://www.sqlservercentral.com/articles/T-SQL/62867/

    2. With R as the number of rows in your tally table the maximum number of rows this

    function will create is (R*R) for each "tile" group per partition. R also represents

    the maximum number of tile groups available. A one million row tally table will requires

    roughly 20MB of uncompressed disk space and will support up to one million tile groups

    with up to one trillion rows per tile group.

    3. For best results make sure that a P.O.C. index is available for your query. For more

    information about P.O.C. indexes see:

    http://sqlmag.com/sql-server-2012/sql-server-2012-how-write-t-sql-window-functions-part-3

    4. NTally is deterministic; for more about deterministic and nondeterministic functions

    see https://msdn.microsoft.com/en-us/library/ms178091.aspx

    Examples:

    --===== 1. Demonstrating how the function mimics NTILE

    -- To better understand NTally, run the code below by running with different values

    -- assigned to @rows and @tiles.

    DECLARE @rows bigint = 8, @tiles bigint = 3;

    SELECT rn, tile, NTILE(@tiles) OVER (ORDER BY rn) as [NTILE]

    FROM dbo.NTally(@tiles, @rows);

    --===== 2. Using NTally as a faster alternative to NTILE (with no PARTITION BY clause)

    -- Run the code below from <START> to <END>.

    -- Note how you get the same result but how, the more rows you add, the more efficient

    -- the NTALLY solution is, with respect to reads, when compared to NTILE:

    -- e.g. NTILE against 100K rows = 200K+ reads, only 560+ reads for the the NTally method

    -- <START>

    -- Declare variables

    DECLARE @rows bigint = 8, @tiles bigint = 5;

    -- Setup sample data

    DECLARE @SomeTable TABLE (SomeValue int primary key);

    INSERT @SomeTable

    SELECT TOP(@rows) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))*5

    FROM sys.all_columns a, sys.all_columns b;

    -- How to divide @some table into 3 tile groups using NTILE

    SET STATISTICS IO ON;

    PRINT 'NTILE version:';

    SELECT SomeValue, NTILE(@tiles) OVER (ORDER BY SomeValue) AS TileGroup

    FROM @SomeTable;

    -- How to divide @SomeTable into 3 tile groups using NTally

    PRINT CHAR(10)+'NTally version:';

    WITH anchor AS

    (

    SELECT SomeValue, ROW_NUMBER() OVER (ORDER BY SomeValue) AS rn

    FROM @SomeTable

    )

    SELECT SomeValue, nt.tile AS TileGroup

    FROM anchor a

    CROSS APPLY dbo.NTally(@tiles, (SELECT COUNT(*) FROM @SomeTable)) nt

    WHERE a.rn = nt.rn;

    SET STATISTICS IO OFF;

    -- <END>

    --===== 3. Using NTally an alternative to NTILE with a PARTITION BY clause

    -- Create sample table with 10 rows and 3 partitions

    IF OBJECT_ID('tempdb..#SomeTable') IS NOT NULL DROP TABLE #SomeTable;

    CREATE TABLE #SomeTable

    (

    PartitionKey int NOT NULL,

    SomeValue int NOT NULL,

    CONSTRAINT pk_SomeTable PRIMARY KEY(PartitionKey,SomeValue)

    );

    INSERT #SomeTable

    SELECT TOP (12)

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))/5+1,

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))*5

    FROM sys.all_columns;

    -- Using NTILE and PARTITION BY

    SELECT

    s.PartitionKey,

    s.SomeValue,

    NTILE(3) OVER (PARTITION BY s.PartitionKey ORDER BY s.SomeValue) AS TileGroup

    FROM #SomeTable s;

    -- Using the NTally function

    WITH

    anchor AS -- Use ROW_NUMBER for your partitioning and sorting

    (

    SELECT

    rn = ROW_NUMBER() OVER (PARTITION BY PartitionKey ORDER BY SomeValue),

    PartitionKey,

    SomeValue

    FROM #SomeTable v

    ),

    parts AS -- collect the number of rows per partition

    (

    SELECT PartitionKey, mxrn = MAX(rn)

    FROM anchor

    GROUP BY PartitionKey

    )

    SELECT a.PartitionKey, a.SomeValue, nt.tile AS TileGroup

    FROM parts p

    CROSS APPLY dbo.NTally(3,mxrn) nt

    CROSS APPLY anchor a

    WHERE p.PartitionKey = a.PartitionKey AND a.rn = nt.rn;

    DROP TABLE #SomeTable;

    ---------------------------------------------------------------------------------------

    Revision History:

    Rev 00 - 20140501 - Initial Creation - Alan Burstein

    Rev 01 - 20160224 - Final touches and optimization including comments - Alan Burstein

    ****************************************************************************************/

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    calculate_tiles AS -- Calculate the number of tiles per tile group

    (

    SELECT t.N, tile = (@rows/@tiles) + CASE WHEN t.N <= (@rows%@tiles) THEN 1 ELSE 0 END

    FROM dbo.tally t

    WHERE t.N <= @tiles

    ),

    assemble_tiles AS

    (

    SELECT tile = topn.N

    FROM calculate_tiles ct

    CROSS APPLY

    (

    SELECT TOP(ct.tile) N = ct.N

    FROM dbo.tally t1 CROSS JOIN dbo.tally t2

    ) topn

    )

    SELECT TOP 100 PERCENT

    rn = ROW_NUMBER() OVER (ORDER BY a.tile), -- Your anchor row

    a.tile

    FROM assemble_tiles a

    ORDER BY a.tile; -- Your spoon

    GO

    A Quick premier on how to use dbo.NTally to replace NTILE in your requirement. Looking at my solution from earlier in this thread let's start with how we'd use dbo.NTally to calculate Q1 instead of NTILE.

    -- what you have now

    SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2

    FROM #t

    WHERE val1 IS NOT NULL;

    -- new and improved code using my NTally function

    WITH

    anchor AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2

    FROM #t

    WHERE val1 IS NOT NULL

    )

    SELECT ID, val1, Q1 = nt.tile, val2

    FROM anchor a

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

    WHERE a.rn = nt.rn;

    Yes, that's more code but, in a 1,000,000 row test the NTILE method generates nearly 3,000,0000 reads whereas the dbo.NTally method only generates fewer than 1,000 reads.

    Next lets look at how we would improve the UNION ALL part of the query...

    SELECT ID, val1, Q1 = NULL, val2

    FROM #t

    WHERE val1 IS NULL

    UNION ALL

    SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2

    FROM #t

    WHERE val1 IS NOT NULL;

    -- new and improved code using my NTally function

    WITH

    anchor AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2

    FROM #t

    WHERE val1 IS NOT NULL

    )

    SELECT ID, val1, Q1 = NULL, val2

    FROM #t

    WHERE val1 IS NULL

    UNION ALL

    SELECT ID, val1, Q1 = nt.tile, val2

    FROM anchor a

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

    WHERE a.rn = nt.rn;

    Looking at my previous solution it would get too ugly and complex to replace NTile with my dbo.NTally function all in one query. Here's what it would look like to include a Q2:

    -- what you have now

    PRINT 'NTILE';

    WITH

    cteQ1 AS

    (

    SELECT ID, val1, Q1 = NULL, val2

    FROM #t

    WHERE val1 IS NULL

    UNION ALL

    SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2

    FROM #t

    WHERE val1 IS NOT NULL

    ),

    cteQ2 AS

    (

    SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)

    FROM cteQ1

    WHERE val2 IS NOT NULL

    )

    SELECT

    cteQ1.ID,

    cteQ1.val1,

    cteQ1.Q1,

    cteQ2.val2,

    cteQ2.Q2

    FROM cteQ1

    LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID;

    -- How we'd do it using dbo.NTally

    PRINT CHAR(10)+'NTally';

    WITH

    anchor AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2

    FROM #t

    WHERE val1 IS NOT NULL

    ),

    cteQ1 AS

    (

    SELECT ID, val1, Q1 = NULL, val2

    FROM #t

    WHERE val1 IS NULL

    UNION ALL

    SELECT ID, val1, Q1 = nt.tile, val2

    FROM anchor a

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

    WHERE a.rn = nt.rn

    ),

    cteQ2anchor AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val2), ID, val2

    FROM cteQ1

    WHERE val2 IS NOT NULL

    ),

    cteQ2anchorCount AS (SELECT Q2ac = COUNT(*) FROM cteQ2anchor),

    cteQ2 AS

    (

    SELECT ID, val2, Q2 = nt.tile

    FROM cteQ2anchor Q2a

    CROSS APPLY cteQ2anchorCount Q2ac

    CROSS APPLY dbo.NTally(4, Q2ac.Q2ac) nt

    WHERE Q2a.rn = nt.rn

    )

    SELECT

    cteQ1.ID,

    cteQ1.val1,

    cteQ1.Q1,

    cteQ2.val2,

    cteQ2.Q2

    FROM cteQ1

    LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID;

    So, instead, here's how I'd do it for maximum performance. First, I'd create this index:

    CREATE NONCLUSTERED INDEX nc_t_val1_temp2 ON #t(val1) INCLUDE (ID, val2, val3, val4); then do this:

    The Final Solution using dbo.NTally

    IF OBJECT_ID('tempdb..#FR') IS NOT NULL DROP TABLE #FR;

    CREATE TABLE #FR

    (

    ID int NOT NULL,

    val1 int NULL,

    Q1 int NULL,

    val2 int NULL,

    Q2 int NULL,

    val3 int NULL,

    Q3 int NULL,

    val4 int NULL,

    Q4 int NULL,

    );

    GO

    WITH

    anchor AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2, val3, val4

    FROM #t

    WHERE val1 IS NOT NULL

    )

    INSERT INTO #FR

    SELECT ID, val1, Q1 = NULL, val2, Q2 = NULL, val3, Q3 = NULL, val4, Q4 = NULL

    FROM #t

    WHERE val1 IS NULL

    UNION ALL

    SELECT ID, val1, Q1 = nt.tile, val2, NULL, val3, NULL, val4, NULL

    FROM anchor a

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

    WHERE a.rn = nt.rn;

    CREATE CLUSTERED INDEX pk_FR ON #FR(ID);

    CREATE NONCLUSTERED INDEX nc__V2_ID ON #FR(val2) INCLUDE (ID,Q2);

    CREATE NONCLUSTERED INDEX nc__V3_ID ON #FR(val3) INCLUDE (ID,Q3);

    CREATE NONCLUSTERED INDEX nc__V4_ID ON #FR(val4) INCLUDE (ID,Q4);

    -- Q2

    WITH cteQ2 AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY Val2), ID, Val2, Q2

    FROM #FR

    WHERE val2 IS NOT NULL

    )

    UPDATE cteQ2

    SET Q2 = nt.tile

    FROM cteQ2

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ2)) nt

    WHERE cteQ2.rn = nt.rn;

    -- Q3

    WITH cteQ3 AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val3), ID, val3, Q3

    FROM #FR

    WHERE val3 IS NOT NULL

    )

    UPDATE cteQ3

    SET Q3 = nt.tile

    FROM cteQ3

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ3)) nt

    WHERE cteQ3.rn = nt.rn;

    -- Q4

    WITH cteQ4 AS

    (

    SELECT rn = ROW_NUMBER() OVER (ORDER BY val4), ID, val4, Q4

    FROM #FR

    WHERE val4 IS NOT NULL

    )

    UPDATE cteQ4

    SET Q4 = nt.tile

    FROM cteQ4

    CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ4)) nt

    WHERE cteQ4.rn = nt.rn;

    -- The final results:

    SELECT * FROM #FR;

    This is obviously much more code but, for 92 million rows, we're talking about several 1,000 reads vs several billion reads. Hope that answers your question. :hehe:

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • EdgeOfKnowWare (2/25/2016)


    +1

    Cracked me up!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Seriously though, thanks. That's a lot to pour over. I will definitely test this out and see if it's something that we can use.

  • Alan.B (2/25/2016)


    The NTally Inline Table Valued Function

    N-I-I-I-I-C-E! Great documentation, too!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/13/2016)


    Alan.B (2/25/2016)


    The NTally Inline Table Valued Function

    N-I-I-I-I-C-E! Great documentation, too!

    Thank you, thank you! I thought you might like that ;-).

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 8 posts - 16 through 22 (of 22 total)

You must be logged in to reply to this topic. Login to reply