Odd (to me) Behavior with NTILE() and NULL Values in Source Data

Question

Odd (to me) Behavior with NTILE() and NULL Values in Source Data

Viewing 8 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Rich Mechaber SSChampion Points: 11131 More actions · Answer 1

I think in this case you weren't hijacking at all (I believe you said you arrived here via Google for a similar issue), and I've appreciated the additional testing and discussion. Thanks for adding additional value to the thread,

Rich

Alan Burstein SSC Guru Points: 61152 More actions · Answer 2

Just a quick update:

I have a good solution that I'm still testing (note my earlier comment). It has taken me longer than expected. I'll try to post it in the morning.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Alan Burstein SSC Guru Points: 61152 More actions · Answer 3

Ok, here we go. This may be a bit much to follow but the key take away is that I'm showing you an excellent alternative to NTILE using a tally table[/url] and CROSS APPLY[/url].

Code for the tally table

Note that this solution uses a permanent tally table (a CTE tally or "getnumbers" function will result in much poorer performance.)

IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;

IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;

CREATE TABLE dbo.tally (N int NOT NULL);

INSERT dbo.tally

SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a, sys.all_columns b;

ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N) WITH FILLFACTOR=100;

ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);

The NTally Inline Table Valued Function

Play around with the example code inside the comments to better understand how you would use dbo.NTally to replace NTile.

IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;

GO

CREATE FUNCTION dbo.NTally(@tiles bigint, @rows bigint)

/****************************************************************************************

Purpose:

Returns a tally table with "tile groups" and can be used as an alternative to the T-SQL

NTILE function introduced in SQL Server 2005. See the usage examples below for more

details on how to use this function.

Compatibility:

SQL Server 2005+

Syntax:

SELECT rn, tile

FROM dbo.NTally(@tiles, @rows);

Parameters:

@tiles = bigint; requested number of tile groups (same as the parameter passed to NTILE)

@rows = bigint; the number of rows to be "tiled" (have group number assigned to it)

Return Types:

Inline Table Valued Function returns:

rn = bigint; a row number beginning with 1 and ending with @rows

tile = int; a "tile number" or group number the same

Developer Notes:

1. Requires a tally table named dbo.tally. Run the code below to create a correctly

indexed tally table (note the "Beginning" and "End" of tally code).

--===== Beginning of dbo.tally code

-- Drop if NTALLY function and tally table if they exist

IF OBJECT_ID('dbo.NTally') IS NOT NULL DROP FUNCTION dbo.NTally;

IF OBJECT_ID('dbo.tally') IS NOT NULL DROP TABLE dbo.tally;

-- Create the tally table

CREATE TABLE dbo.tally (N int NOT NULL);

-- Insert the numbers 1 through 1,000,0000 into dbo.tally

INSERT dbo.tally

SELECT TOP (1000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a, sys.all_columns b;

-- Create required primary key, clustered index and unique nonclustered indexes

ALTER TABLE dbo.tally ADD CONSTRAINT pk_tally PRIMARY KEY CLUSTERED(N)

WITH FILLFACTOR=100;

-- Note that, when present, the optimizer will chose the unique nonclustered index

-- which gets the job done with fewer reads. This index is mandetory for dbo.NTally

-- to function optimally

ALTER TABLE dbo.tally ADD CONSTRAINT uq_tally UNIQUE NONCLUSTERED(N);

--===== End of tally table code

Note that, if you need to use a different schema and/or name for your tally table

make sure to change the tally table schema/name in the function to match. To learn

more about tally tables see http://www.sqlservercentral.com/articles/T-SQL/62867/

2. With R as the number of rows in your tally table the maximum number of rows this

function will create is (R*R) for each "tile" group per partition. R also represents

the maximum number of tile groups available. A one million row tally table will requires

roughly 20MB of uncompressed disk space and will support up to one million tile groups

with up to one trillion rows per tile group.

3. For best results make sure that a P.O.C. index is available for your query. For more

information about P.O.C. indexes see:

http://sqlmag.com/sql-server-2012/sql-server-2012-how-write-t-sql-window-functions-part-3

4. NTally is deterministic; for more about deterministic and nondeterministic functions

see https://msdn.microsoft.com/en-us/library/ms178091.aspx

Examples:

--===== 1. Demonstrating how the function mimics NTILE

-- To better understand NTally, run the code below by running with different values

-- assigned to @rows and @tiles.

DECLARE @rows bigint = 8, @tiles bigint = 3;

SELECT rn, tile, NTILE(@tiles) OVER (ORDER BY rn) as [NTILE]

FROM dbo.NTally(@tiles, @rows);

--===== 2. Using NTally as a faster alternative to NTILE (with no PARTITION BY clause)

-- Run the code below from <START> to <END>.

-- Note how you get the same result but how, the more rows you add, the more efficient

-- the NTALLY solution is, with respect to reads, when compared to NTILE:

-- e.g. NTILE against 100K rows = 200K+ reads, only 560+ reads for the the NTally method

-- <START>

-- Declare variables

DECLARE @rows bigint = 8, @tiles bigint = 5;

-- Setup sample data

DECLARE @SomeTable TABLE (SomeValue int primary key);

INSERT @SomeTable

SELECT TOP(@rows) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))*5

FROM sys.all_columns a, sys.all_columns b;

-- How to divide @some table into 3 tile groups using NTILE

SET STATISTICS IO ON;

PRINT 'NTILE version:';

SELECT SomeValue, NTILE(@tiles) OVER (ORDER BY SomeValue) AS TileGroup

FROM @SomeTable;

-- How to divide @SomeTable into 3 tile groups using NTally

PRINT CHAR(10)+'NTally version:';

WITH anchor AS

(

SELECT SomeValue, ROW_NUMBER() OVER (ORDER BY SomeValue) AS rn

FROM @SomeTable

)

SELECT SomeValue, nt.tile AS TileGroup

FROM anchor a

CROSS APPLY dbo.NTally(@tiles, (SELECT COUNT(*) FROM @SomeTable)) nt

WHERE a.rn = nt.rn;

SET STATISTICS IO OFF;

-- <END>

--===== 3. Using NTally an alternative to NTILE with a PARTITION BY clause

-- Create sample table with 10 rows and 3 partitions

IF OBJECT_ID('tempdb..#SomeTable') IS NOT NULL DROP TABLE #SomeTable;

CREATE TABLE #SomeTable

(

PartitionKey int NOT NULL,

SomeValue int NOT NULL,

CONSTRAINT pk_SomeTable PRIMARY KEY(PartitionKey,SomeValue)

);

INSERT #SomeTable

SELECT TOP (12)

ROW_NUMBER() OVER (ORDER BY (SELECT NULL))/5+1,

ROW_NUMBER() OVER (ORDER BY (SELECT NULL))*5

FROM sys.all_columns;

-- Using NTILE and PARTITION BY

SELECT

s.PartitionKey,

s.SomeValue,

NTILE(3) OVER (PARTITION BY s.PartitionKey ORDER BY s.SomeValue) AS TileGroup

FROM #SomeTable s;

-- Using the NTally function

WITH

anchor AS -- Use ROW_NUMBER for your partitioning and sorting

(

SELECT

rn = ROW_NUMBER() OVER (PARTITION BY PartitionKey ORDER BY SomeValue),

PartitionKey,

SomeValue

FROM #SomeTable v

),

parts AS -- collect the number of rows per partition

(

SELECT PartitionKey, mxrn = MAX(rn)

FROM anchor

GROUP BY PartitionKey

)

SELECT a.PartitionKey, a.SomeValue, nt.tile AS TileGroup

FROM parts p

CROSS APPLY dbo.NTally(3,mxrn) nt

CROSS APPLY anchor a

WHERE p.PartitionKey = a.PartitionKey AND a.rn = nt.rn;

DROP TABLE #SomeTable;

---------------------------------------------------------------------------------------

Revision History:

Rev 00 - 20140501 - Initial Creation - Alan Burstein

Rev 01 - 20160224 - Final touches and optimization including comments - Alan Burstein

****************************************************************************************/

RETURNS TABLE WITH SCHEMABINDING AS RETURN

WITH

calculate_tiles AS -- Calculate the number of tiles per tile group

(

SELECT t.N, tile = (@rows/@tiles) + CASE WHEN t.N <= (@rows%@tiles) THEN 1 ELSE 0 END

FROM dbo.tally t

WHERE t.N <= @tiles

),

assemble_tiles AS

(

SELECT tile = topn.N

FROM calculate_tiles ct

CROSS APPLY

(

SELECT TOP(ct.tile) N = ct.N

FROM dbo.tally t1 CROSS JOIN dbo.tally t2

) topn

)

SELECT TOP 100 PERCENT

rn = ROW_NUMBER() OVER (ORDER BY a.tile), -- Your anchor row

a.tile

FROM assemble_tiles a

ORDER BY a.tile; -- Your spoon

GO

A Quick premier on how to use dbo.NTally to replace NTILE in your requirement. Looking at my solution from earlier in this thread let's start with how we'd use dbo.NTally to calculate Q1 instead of NTILE.

-- what you have now

SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2

FROM #t

WHERE val1 IS NOT NULL;

-- new and improved code using my NTally function

WITH

anchor AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2

FROM #t

WHERE val1 IS NOT NULL

)

SELECT ID, val1, Q1 = nt.tile, val2

FROM anchor a

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

WHERE a.rn = nt.rn;

Yes, that's more code but, in a 1,000,000 row test the NTILE method generates nearly 3,000,0000 reads whereas the dbo.NTally method only generates fewer than 1,000 reads.

Next lets look at how we would improve the UNION ALL part of the query...

SELECT ID, val1, Q1 = NULL, val2

FROM #t

WHERE val1 IS NULL

UNION ALL

SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2

FROM #t

WHERE val1 IS NOT NULL;

-- new and improved code using my NTally function

WITH

anchor AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2

FROM #t

WHERE val1 IS NOT NULL

)

SELECT ID, val1, Q1 = NULL, val2

FROM #t

WHERE val1 IS NULL

UNION ALL

SELECT ID, val1, Q1 = nt.tile, val2

FROM anchor a

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

WHERE a.rn = nt.rn;

Looking at my previous solution it would get too ugly and complex to replace NTile with my dbo.NTally function all in one query. Here's what it would look like to include a Q2:

-- what you have now

PRINT 'NTILE';

WITH

cteQ1 AS

(

SELECT ID, val1, Q1 = NULL, val2

FROM #t

WHERE val1 IS NULL

UNION ALL

SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2

FROM #t

WHERE val1 IS NOT NULL

),

cteQ2 AS

(

SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)

FROM cteQ1

WHERE val2 IS NOT NULL

)

SELECT

cteQ1.ID,

cteQ1.val1,

cteQ1.Q1,

cteQ2.val2,

cteQ2.Q2

FROM cteQ1

LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID;

-- How we'd do it using dbo.NTally

PRINT CHAR(10)+'NTally';

WITH

anchor AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2

FROM #t

WHERE val1 IS NOT NULL

),

cteQ1 AS

(

SELECT ID, val1, Q1 = NULL, val2

FROM #t

WHERE val1 IS NULL

UNION ALL

SELECT ID, val1, Q1 = nt.tile, val2

FROM anchor a

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

WHERE a.rn = nt.rn

),

cteQ2anchor AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val2), ID, val2

FROM cteQ1

WHERE val2 IS NOT NULL

),

cteQ2anchorCount AS (SELECT Q2ac = COUNT(*) FROM cteQ2anchor),

cteQ2 AS

(

SELECT ID, val2, Q2 = nt.tile

FROM cteQ2anchor Q2a

CROSS APPLY cteQ2anchorCount Q2ac

CROSS APPLY dbo.NTally(4, Q2ac.Q2ac) nt

WHERE Q2a.rn = nt.rn

)

SELECT

cteQ1.ID,

cteQ1.val1,

cteQ1.Q1,

cteQ2.val2,

cteQ2.Q2

FROM cteQ1

LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID;

So, instead, here's how I'd do it for maximum performance. First, I'd create this index:

CREATE NONCLUSTERED INDEX nc_t_val1_temp2 ON #t(val1) INCLUDE (ID, val2, val3, val4); then do this:

The Final Solution using dbo.NTally

IF OBJECT_ID('tempdb..#FR') IS NOT NULL DROP TABLE #FR;

CREATE TABLE #FR

(

ID int NOT NULL,

val1 int NULL,

Q1 int NULL,

val2 int NULL,

Q2 int NULL,

val3 int NULL,

Q3 int NULL,

val4 int NULL,

Q4 int NULL,

);

GO

WITH

anchor AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val1), ID, val1, val2, val3, val4

FROM #t

WHERE val1 IS NOT NULL

)

INSERT INTO #FR

SELECT ID, val1, Q1 = NULL, val2, Q2 = NULL, val3, Q3 = NULL, val4, Q4 = NULL

FROM #t

WHERE val1 IS NULL

UNION ALL

SELECT ID, val1, Q1 = nt.tile, val2, NULL, val3, NULL, val4, NULL

FROM anchor a

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM anchor)) nt

WHERE a.rn = nt.rn;

CREATE CLUSTERED INDEX pk_FR ON #FR(ID);

CREATE NONCLUSTERED INDEX nc__V2_ID ON #FR(val2) INCLUDE (ID,Q2);

CREATE NONCLUSTERED INDEX nc__V3_ID ON #FR(val3) INCLUDE (ID,Q3);

CREATE NONCLUSTERED INDEX nc__V4_ID ON #FR(val4) INCLUDE (ID,Q4);

-- Q2

WITH cteQ2 AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY Val2), ID, Val2, Q2

FROM #FR

WHERE val2 IS NOT NULL

)

UPDATE cteQ2

SET Q2 = nt.tile

FROM cteQ2

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ2)) nt

WHERE cteQ2.rn = nt.rn;

-- Q3

WITH cteQ3 AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val3), ID, val3, Q3

FROM #FR

WHERE val3 IS NOT NULL

)

UPDATE cteQ3

SET Q3 = nt.tile

FROM cteQ3

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ3)) nt

WHERE cteQ3.rn = nt.rn;

-- Q4

WITH cteQ4 AS

(

SELECT rn = ROW_NUMBER() OVER (ORDER BY val4), ID, val4, Q4

FROM #FR

WHERE val4 IS NOT NULL

)

UPDATE cteQ4

SET Q4 = nt.tile

FROM cteQ4

CROSS APPLY dbo.NTally(4, (SELECT COUNT(*) FROM cteQ4)) nt

WHERE cteQ4.rn = nt.rn;

-- The final results:

SELECT * FROM #FR;

This is obviously much more code but, for 92 million rows, we're talking about several 1,000 reads vs several billion reads. Hope that answers your question. :hehe:

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

John_Hill Ten Centuries Points: 1178 More actions · Answer 4

John_Hill

Ten Centuries

Points: 1178

February 25, 2016 at 11:26 am

#1861192

Alan Burstein SSC Guru Points: 61152 More actions · Answer 5

EdgeOfKnowWare (2/25/2016)

+1

Cracked me up!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

John_Hill Ten Centuries Points: 1178 More actions · Answer 6

Seriously though, thanks. That's a lot to pour over. I will definitely test this out and see if it's something that we can use.

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 7

Alan.B (2/25/2016)
The NTally Inline Table Valued Function

N-I-I-I-I-C-E! Great documentation, too!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Alan Burstein SSC Guru Points: 61152 More actions · Answer 8

Jeff Moden (7/13/2016)
Alan.B (2/25/2016)
The NTally Inline Table Valued Function
N-I-I-I-I-C-E! Great documentation, too!

Thank you, thank you! I thought you might like that ;-).

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001