Odd (to me) Behavior with NTILE() and NULL Values in Source Data

Question

Odd (to me) Behavior with NTILE() and NULL Values in Source Data

Rich Mechaber

SSChampion

Points: 11130
More actions
October 25, 2013 at 4:26 pm

#281709

I just discovered the hard way (on 2005, replicated on 2008R2) that, if you compute NTILE() on a column of data in which some of the values are NULL, all of the NULL inputs will be assigned to a tile. This makes no sense to me: NULL means unknown, so I was expecting SQL Server to skip those rows and compute NTILE() only for non-NULL rows and assign those tiles based solely on the non-NULL values. Nope!
What this means in practice: NULLs apparently get assigned to the lowest tile, which pushes all of your non-NULL, real data up in ranking and will almost certainly assign some of them to a higher tile than you might expect.
Here's code to reproduce the behavior:
--NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
--displacing other values to higher tiles.
IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL
);
--INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT 5;
INSERT INTO #t SELECT 7;
INSERT INTO #t SELECT 8;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 11;
INSERT INTO #t SELECT 12;
--Compute quartiles.
--With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
--Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
--The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:
SELECT *,
NTILE(4) OVER (ORDER BY val) As Q
FROM #t;
and here's what the query returns:
[font="Courier New"]ID val Q
---------
1 NULL 1
2 NULL 1
3 NULL 1
4 5 2
5 7 2
6 8 2
7 9 3
8 9 3
9 11 4
10 12 4[/font]
I see no mention of this on BOL.
Is this common knowledge and I'm just slow? I could not find any discussion of this anywhere.
Thanks if anyone can explain to me why this works this way,
Rich

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply

Dwain Camps SSC Guru Points: 86908 More actions · Answer 1

Remarks from BOL:

If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.

So it seems that only the total number of rows is important not the total number of non-NULL rows.

I agree that the behavior seems unusual but I don't see it being in conflict with the remarks.

Note that ROW_NUMBER() works the same way.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Rich Mechaber SSChampion Points: 11130 More actions · Answer 2

Thanks Dwain.c.

No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.

How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.

Thanks,

Rich

Dwain Camps SSC Guru Points: 86908 More actions · Answer 3

rmechaber (10/28/2013)
Thanks Dwain.c.
No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.
How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.
Thanks,
Rich

I did say that "I agree that the behavior seems unusual" and I'll go further to say that your contention about the statistical significance of the NULLs is correct. You do always have the option to eliminate the NULLs with a WHERE clause, but obviously that doesn't change the counter-intuitive nature of Microsoft's implementation decision.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

John_Hill Ten Centuries Points: 1173 More actions · Answer 4

Maybe a work around would be something like this:

SELECT *,

CASE WHEN val IS NULL THEN 0 ELSE 1 END,

NTILE(4) OVER (PARTITION BY CASE WHEN val IS NULL THEN 0 ELSE 1 END ORDER BY val) As Q

FROM #t;

Only use the rank when CASE WHEN val IS NULL THEN 0 ELSE 1 END = 1

Alan Burstein SSC Guru Points: 61136 More actions · Answer 5

EdgeOfKnowWare (2/22/2016)
Maybe a work around would be something like this:
SELECT *,
CASE WHEN val IS NULL THEN 0 ELSE 1 END,
NTILE(4) OVER (PARTITION BY CASE WHEN val IS NULL THEN 0 ELSE 1 END ORDER BY val) As Q
FROM #t;
Only use the rank when CASE WHEN val IS NULL THEN 0 ELSE 1 END = 1

I suspect that you were trying to do something like this:

SELECT *,

CASE WHEN val IS NULL THEN 1 ELSE NTILE(4) OVER (ORDER BY val) END-1 As Q

FROM #t;

... which will assign a 0 to NULL values and apply NTILE(4) to the non-null values.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

John_Hill Ten Centuries Points: 1173 More actions · Answer 6

No, that's not quite what I was trying to do.

The problem I had (which lead me to this post via Google) was that I had several different rankings dependent upon several different columns. I didn't want to have to read through the table each time for each column, but I was having the same problem the original poster had. When I was testing my solution, I tried the code you supplied (great idea, by the way), but it didn't give me a full ranking. In the test data that I was using, I wanted ranks from 1-10 for values that IS NOT NULL, but when I used that particular snippet of code, it gave me 1-5 for the columns that had data and no ranking for NULL values. Not quite what I was after.

So, I had to partition the NTILE, one partition for NULL values, and one partition for NOT NULL values. Then, it my select statement for insert, I use the CASE statement. If it is NULL, then don't enter a ranking. If it is NOT NULL, enter a ranking.

Make sense?

Alan Burstein SSC Guru Points: 61136 More actions · Answer 7

EdgeOfKnowWare (2/22/2016)
No, that's not quite what I was trying to do.
The problem I had (which lead me to this post via Google) was that I had several different rankings dependent upon several different columns. I didn't want to have to read through the table each time for each column, but I was having the same problem the original poster had. When I was testing my solution, I tried the code you supplied (great idea, by the way), but it didn't give me a full ranking. In the test data that I was using, I wanted ranks from 1-10 for values that IS NOT NULL, but when I used that particular snippet of code, it gave me 1-5 for the columns that had data and no ranking for NULL values. Not quite what I was after.
So, I had to partition the NTILE, one partition for NULL values, and one partition for NOT NULL values. Then, it my select statement for insert, I use the CASE statement. If it is NULL, then don't enter a ranking. If it is NOT NULL, enter a ranking.
Make sense?

Yep that makes sense and what you posted is a good solution. Another way you can do what you are trying to do is like this:

SELECT *, 0,

NTILE(4) OVER (ORDER BY (SELECT NULL)) As Q -- using "ORDER BY (SELECT NULL)" eliminates the sort here

FROM #t

WHERE val IS NULL

UNION ALL

SELECT *, 1,

NTILE(4) OVER (ORDER BY val) As Q -- the POC index will handle sorting here

FROM #t

WHERE val IS NOT NULL;

It's a little more code but will get you the same results as your query but without the need to sort the NULL values (No reason to sort a bunch of NULLs).

If you add a POC index (see this Itzek Ben Gan article for more info about POC indexes), run what I posted along with your solution, you'll see that my solution is able to utilize the POC index and eliminate all the sorts from the query plan. Partitioning by a case statement, on the other hand, will almost certainly guarantee a sort operator which you don't want. Here's all the code with comments:

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;

CREATE TABLE #t

(

ID INT IDENTITY(1,1) PRIMARY KEY,

val INT NULL

);

INSERT INTO #t SELECT NULL;

INSERT INTO #t SELECT 5;

INSERT INTO #t SELECT 7;

INSERT INTO #t SELECT 8;

INSERT INTO #t SELECT 9;

INSERT INTO #t SELECT 11;

INSERT INTO #t SELECT 12;

-- The P.O.C. Index

CREATE NONCLUSTERED INDEX nc_t_temp ON #t(val) INCLUDE (ID);

GO

-- Run both queries with "Show Actual Execution plan turned on"

-- Original Solution

SELECT *,

CASE WHEN val IS NULL THEN 0 ELSE 1 END,

NTILE(4) OVER (PARTITION BY CASE WHEN val IS NULL THEN 0 ELSE 1 END ORDER BY val) As Q

FROM #t;

-- Alternate solution which will not require any sorting

SELECT *, 0,

NTILE(4) OVER (ORDER BY (SELECT NULL)) As Q -- using "ORDER BY (SELECT NULL)" eliminates the sort here

FROM #t

WHERE val IS NULL

UNION ALL

SELECT *, 1,

NTILE(4) OVER (ORDER BY val) As Q -- the POC index will handle sorting here

FROM #t

WHERE val IS NOT NULL;

Just a little food for thought. 😉

Edit.... Further simplified my solution by replacing "CASE WHEN val IS NULL THEN 0 ELSE 1 END" with a static value.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

John_Hill Ten Centuries Points: 1173 More actions · Answer 8

Alan, I agree with you, if you're just wanting one ranking. If we expand upon the Original Posters issue (which is what I was facing) and say that you want to different groupings, how would you handle that?

So, let's take the example and add a column.

--NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,

--displacing other values to higher tiles.

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;

CREATE TABLE #t

(

ID INT IDENTITY(1,1) PRIMARY KEY,

val INT NULL,

val2 INT NULL

);

--INSERT 10 values: 3 NULL, 7 integers

INSERT INTO #t SELECT NULL,1;

INSERT INTO #t SELECT NULL,2;

INSERT INTO #t SELECT NULL,3;

INSERT INTO #t SELECT 5,4;

INSERT INTO #t SELECT 7,5;

INSERT INTO #t SELECT 8,6;

INSERT INTO #t SELECT 9,NULL;

INSERT INTO #t SELECT 11,NULL;

INSERT INTO #t SELECT 12,NULL;

--Compute quartiles.

--With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).

--Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.

--The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:

Suppose you want the results of this:

IDvalQval2Q2

1NULLNULL11

2NULLNULL21

3NULLNULL32

45142

57153

68264

792NULLNULL

893NULLNULL

9113NULLNULL

10124NULLNULL

How would you write the query then? That's the direction I was going.

Alan Burstein SSC Guru Points: 61136 More actions · Answer 9

EdgeOfKnowWare (2/22/2016)
Alan, I agree with you, if you're just wanting one ranking. If we expand upon the Original Posters issue (which is what I was facing) and say that you want to different groupings, how would you handle that?
So, let's take the example and add a column.
--NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
--displacing other values to higher tiles.
IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL,
val2 INT NULL
);
--INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL,1;
INSERT INTO #t SELECT NULL,2;
INSERT INTO #t SELECT NULL,3;
INSERT INTO #t SELECT 5,4;
INSERT INTO #t SELECT 7,5;
INSERT INTO #t SELECT 8,6;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 11,NULL;
INSERT INTO #t SELECT 12,NULL;
--Compute quartiles.
--With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
--Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
--The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:
Suppose you want the results of this:
IDvalQval2Q2
1NULLNULL11
2NULLNULL21
3NULLNULL32
45142
57153
68264
792NULLNULL
893NULLNULL
9113NULLNULL
10124NULLNULL
How would you write the query then? That's the direction I was going.

For something like that I would do this:

WITH

Q1 AS

(

SELECT

ID,

val,

Q = NULL,

val2

FROM #t

WHERE val IS NULL

UNION ALL

SELECT

ID,

val,

Q = NTILE(4) OVER (ORDER BY val),

val2

FROM #t

WHERE val IS NOT NULL

),

Q2 AS

(

SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)

FROM Q1

WHERE val2 IS NOT NULL

)

SELECT

Q1.ID,

val,

Q,

Q1.val2,

Q2

FROM Q1

LEFT JOIN Q2 ON Q1.ID = Q2.ID;

If you went that route you would really want this index:

CREATE NONCLUSTERED INDEX nc_t_val1_temp ON #t(val, val2) INCLUDE (ID);

and would be able to further improve performance using this index:

CREATE NONCLUSTERED INDEX nc_t_val2_temp ON #t(val2) INCLUDE (ID, val);

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

John_Hill Ten Centuries Points: 1173 More actions · Answer 10

So, to dig even further...

If your table had 92 million records, and you had 9 different "Val" columns, would you still use the same code? Or do it differently?

I'm wondering, would the cost of the sort operator for one read of the table be lower than the cost of reading the table or index 9 (or more) times due to the union clause?

I'm not trying to play "gotcha", just curious how you would do it, and why you would pick that route.

Alan Burstein SSC Guru Points: 61136 More actions · Answer 11

So, to dig even further...

SSC is the right place 😉

If your table had 92 million records, and you had 9 different "Val" columns, would you still use the same code? Or do it differently?

First, so that we're on the same page, here's what my solution would look like for 4 different "Val" columns (so you can see the pattern):

SET NOCOUNT ON;

USE tempdb

GO

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;

CREATE TABLE #t

(

ID INT IDENTITY(1,1) PRIMARY KEY,

val1 INT NULL,

val2 INT NULL,

val3 INT NULL,

val4 INT NULL

);

--INSERT 10 values: 3 NULL, 7 integers

INSERT INTO #t SELECT NULL,1,5,25 ;

INSERT INTO #t SELECT NULL,2,8,35;

INSERT INTO #t SELECT NULL,3,NULL,36;

INSERT INTO #t SELECT 5,4,NULL,38;

INSERT INTO #t SELECT 7,5,11,50;

INSERT INTO #t SELECT 8,6,12,66;

INSERT INTO #t SELECT 9,NULL,15,12;

INSERT INTO #t SELECT 9,NULL,NULL,13;

INSERT INTO #t SELECT 11,NULL,NULL,NULL;

INSERT INTO #t SELECT 12,NULL,NULL,1001;

GO

--Compute quartiles.

--With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).

--Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.

--The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:

-- CRITICAL INDEX:

CREATE NONCLUSTERED INDEX nc_t_val1_temp ON #t(val1, val2, val3, val4) INCLUDE (ID);

-- IMPORTANT INDEXES

CREATE NONCLUSTERED INDEX nc_t_val2_temp ON #t(val2) INCLUDE (ID, val1);

CREATE NONCLUSTERED INDEX nc_t_val3_temp ON #t(val3) INCLUDE (ID, val1);

CREATE NONCLUSTERED INDEX nc_t_val4_temp ON #t(val4) INCLUDE (ID, val1);

GO

WITH

cteQ1 AS

(

SELECT ID, val1, Q1 = NULL, val2, val3, val4

FROM #t

WHERE val1 IS NULL

UNION ALL

SELECT ID, val1, Q1 = NTILE(4) OVER (ORDER BY val1), val2, val3, val4

FROM #t

WHERE val1 IS NOT NULL

),

cteQ2 AS

(

SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)

FROM cteQ1

WHERE val2 IS NOT NULL

),

cteQ3 AS

(

SELECT ID, val3, Q3 = NTILE(4) OVER (ORDER BY val3)

FROM cteQ1

WHERE val3 IS NOT NULL

),

cteQ4 AS

(

SELECT ID, val4, Q4 = NTILE(4) OVER (ORDER BY val4)

FROM cteQ1

WHERE val4 IS NOT NULL

)

SELECT

cteQ1.ID,

cteQ1.val1,

cteQ1.Q1,

cteQ2.val2,

cteQ2.Q2,

cteQ3.val3,

cteQ3.Q3,

cteQ4.val4,

cteQ4.Q4

FROM cteQ1

LEFT JOIN cteQ2 ON cteQ1.ID = cteQ2.ID

LEFT JOIN cteQ3 ON cteQ1.ID = cteQ3.ID

LEFT JOIN cteQ4 ON cteQ1.ID = cteQ4.ID;

With that in mind, for 92 million rows I would most certainly do it differently. Of the 4 "ranking" functions (ROW_NUMBER(), RANK, DENSE_RANK and NTILE) NTILE is irrefutably the poorest performing. You can add a P.O.C. index to prevent any of these functions from requiring a sort operator but, of the four "ranking" functions, only NTILE requires three table spool (lazy spool) operators out of the gate. Table Spool (lazy spool) operators are expensive and should be avoided whenever possible; with NTILE, however, you get 3 no matter what! Note the code below:

-- RUN THIS WITH "SHOW ACTUAL EXECUTION PLAN" TURNED ON

SET STATISTICS IO ON;

PRINT 'ROW_NUMBER:';

SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);

PRINT CHAR(10)+'RANK:';

SELECT RANK() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);

PRINT CHAR(10)+'DENSE_RANK:';

SELECT DENSE_RANK() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);

PRINT CHAR(10)+'NTILE:';

SELECT NTILE(1) OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);

SET STATISTICS IO OFF;

All 4 functions are processing one row of data. ROW_NUMBER, RANK & DENSE_RANK produce efficient plans and complete without any table scans or reads, NTILE requires 3 table scans (against tables it creates in the tempdb) and generates 7 logical reads. For 92 million rows? Forget about it!

I'm wondering, would the cost of the sort operator for one read of the table be lower than the cost of reading the table or index 9 (or more) times due to the union clause?

UNION ALL (they way I'm using it) is inexpensive; for val 1 we're making two trips to the index but we're getting 30% of the rows we need on the first trip and 70% on the second trip. The rows are merged together using the concatenation operator which is very efficient.

A sort operator on the other hand is very expensive; it has an N * LOG(N) complexity which means each row gets more expensive the more rows are sorted. E.G. 10 times as many row means 13 times as much work. Nine "val" columns across 92 million rows would mean nine big-ol, unavoidable (using the CASE statement) sorts and roughly 500 gazillion reads.

I'm not trying to play "gotcha", just curious how you would do it, and why you would pick that route.

I have a much better solution but will need a little time to put it together and explain later today (late afternoon Chicago time).

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Rich Mechaber SSChampion Points: 11130 More actions · Answer 12

Alan.B (2/23/2016)
With that in mind, for 92 million rows I would most certainly do it differently. Of the 4 "ranking" functions (ROW_NUMBER(), RANK, DENSE_RANK and NTILE) NTILE is irrefutably the poorest performing. You can add a P.O.C. index to prevent any of these functions from requiring a sort operator but, of the four "ranking" functions, only NTILE requires three table spool (lazy spool) operators out of the gate. Table Spool (lazy spool) operators are expensive and should be avoided whenever possible; with NTILE, however, you get 3 no matter what! Note the code below:
-- RUN THIS WITH "SHOW ACTUAL EXECUTION PLAN" TURNED ON
SET STATISTICS IO ON;
PRINT 'ROW_NUMBER:';
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);
PRINT CHAR(10)+'RANK:';
SELECT RANK() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);
PRINT CHAR(10)+'DENSE_RANK:';
SELECT DENSE_RANK() OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);
PRINT CHAR(10)+'NTILE:';
SELECT NTILE(1) OVER (ORDER BY (SELECT NULL)) FROM (VALUES (1)) t(c);
SET STATISTICS IO OFF;
All 4 functions are processing one row of data. ROW_NUMBER, RANK & DENSE_RANK produce efficient plans and complete without any table scans or reads, NTILE requires 3 table scans (against tables it creates in the tempdb) and generates 7 logical reads. For 92 million rows? Forget about it!

Hi, Alan, o/p here, and I've enjoyed the follow-on conversation I started.

This aspect of NTILE() is an eye-opening piece of information, thank you for posting it. The data sets against which I was originally running NTILE() were (and always will be) small, so I never gave any thought to the efficiency of the MS function. Certainly something to consider in the future.

Thanks very much for posting it,

Rich

John_Hill Ten Centuries Points: 1173 More actions · Answer 13

Rich Mechaber (2/23/2016)
Hi, Alan, o/p here, and I've enjoyed the follow-on conversation I started.
This aspect of NTILE() is an eye-opening piece of information, thank you for posting it. The data sets against which I was originally running NTILE() were (and always will be) small, so I never gave any thought to the efficiency of the MS function. Certainly something to consider in the future.
Thanks very much for posting it,
Rich

Sorry to hijack your thread, Rich. I thought since my issue was really just an expansion of your issue, I would add to it instead of starting a new thread.

Alan Burstein SSC Guru Points: 61136 More actions · Answer 14

EdgeOfKnowWare (2/23/2016)
Rich Mechaber (2/23/2016)
Hi, Alan, o/p here, and I've enjoyed the follow-on conversation I started.
This aspect of NTILE() is an eye-opening piece of information, thank you for posting it. The data sets against which I was originally running NTILE() were (and always will be) small, so I never gave any thought to the efficiency of the MS function. Certainly something to consider in the future.
Thanks very much for posting it,
Rich
Sorry to hijack your thread, Rich. I thought since my issue was really just an expansion of your issue, I would add to it instead of starting a new thread.

I would not sweat it - these threads sometimes have a life of their own.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001