Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 123»»»

Odd (to me) Behavior with NTILE() and NULL Values in Source Data Expand / Collapse
Author
Message
Posted Friday, October 25, 2013 4:26 PM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Wednesday, June 22, 2016 9:02 AM
Points: 996, Visits: 3,617
I just discovered the hard way (on 2005, replicated on 2008R2) that, if you compute NTILE() on a column of data in which some of the values are NULL, all of the NULL inputs will be assigned to a tile. This makes no sense to me: NULL means unknown, so I was expecting SQL Server to skip those rows and compute NTILE() only for non-NULL rows and assign those tiles based solely on the non-NULL values. Nope!

What this means in practice: NULLs apparently get assigned to the lowest tile, which pushes all of your non-NULL, real data up in ranking and will almost certainly assign some of them to a higher tile than you might expect.

Here's code to reproduce the behavior:
--	NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
-- displacing other values to higher tiles.

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL
);

-- INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT 5;
INSERT INTO #t SELECT 7;
INSERT INTO #t SELECT 8;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 11;
INSERT INTO #t SELECT 12;

-- Compute quartiles.
-- With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
-- Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
-- The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:
SELECT *,
NTILE(4) OVER (ORDER BY val) As Q
FROM #t;


and here's what the query returns:
ID val Q
---------
1 NULL 1
2 NULL 1
3 NULL 1
4 5 2
5 7 2
6 8 2
7 9 3
8 9 3
9 11 4
10 12 4


I see no mention of this on BOL.

Is this common knowledge and I'm just slow? I could not find any discussion of this anywhere.

Thanks if anyone can explain to me why this works this way,
Rich
Post #1508622
Posted Sunday, October 27, 2013 6:47 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Wednesday, February 24, 2016 6:28 AM
Points: 3,977, Visits: 6,431
Remarks from BOL:

If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.

So it seems that only the total number of rows is important not the total number of non-NULL rows.

I agree that the behavior seems unusual but I don't see it being in conflict with the remarks.

Note that ROW_NUMBER() works the same way.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
Post #1508724
Posted Monday, October 28, 2013 6:50 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Wednesday, June 22, 2016 9:02 AM
Points: 996, Visits: 3,617
Thanks Dwain.c.

No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.

How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.

Thanks,
Rich
Post #1508847
Posted Monday, October 28, 2013 6:14 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Wednesday, February 24, 2016 6:28 AM
Points: 3,977, Visits: 6,431
rmechaber (10/28/2013)
Thanks Dwain.c.

No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.

How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.

Thanks,
Rich


I did say that "I agree that the behavior seems unusual" and I'll go further to say that your contention about the statistical significance of the NULLs is correct. You do always have the option to eliminate the NULLs with a WHERE clause, but obviously that doesn't change the counter-intuitive nature of Microsoft's implementation decision.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
Post #1509154
Posted Monday, February 22, 2016 9:36 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, May 9, 2016 9:28 AM
Points: 34, Visits: 283
Maybe a work around would be something like this:

	SELECT *,
CASE WHEN val IS NULL THEN 0 ELSE 1 END,
NTILE(4) OVER (PARTITION BY CASE WHEN val IS NULL THEN 0 ELSE 1 END ORDER BY val) As Q
FROM #t;

Only use the rank when CASE WHEN val IS NULL THEN 0 ELSE 1 END = 1

Post #1763146
Posted Monday, February 22, 2016 10:28 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 10:35 AM
Points: 1,871, Visits: 6,277
EdgeOfKnowWare (2/22/2016)
Maybe a work around would be something like this:

	SELECT *,
CASE WHEN val IS NULL THEN 0 ELSE 1 END,
NTILE(4) OVER (PARTITION BY CASE WHEN val IS NULL THEN 0 ELSE 1 END ORDER BY val) As Q
FROM #t;

Only use the rank when CASE WHEN val IS NULL THEN 0 ELSE 1 END = 1



I suspect that you were trying to do something like this:
SELECT *,
CASE WHEN val IS NULL THEN 1 ELSE NTILE(4) OVER (ORDER BY val) END-1 As Q
FROM #t;

... which will assign a 0 to NULL values and apply NTILE(4) to the non-null values.



-- Alan Burstein



Best practices for getting help on SQLServerCentral
Need to split a string? Try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL 2012+)
Need a pattern-based splitter? Try PatternSplitCM
Need to remove of replace those unwanted characters? Try PatExclude8K and PatReplace8K.

"I can't stress enough the importance of switching from a 'sequential files' mindset to 'set-based' thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. " -- Itzek Ben-Gan 2001
Post #1763163
Posted Monday, February 22, 2016 10:54 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, May 9, 2016 9:28 AM
Points: 34, Visits: 283
No, that's not quite what I was trying to do.

The problem I had (which lead me to this post via Google) was that I had several different rankings dependent upon several different columns. I didn't want to have to read through the table each time for each column, but I was having the same problem the original poster had. When I was testing my solution, I tried the code you supplied (great idea, by the way), but it didn't give me a full ranking. In the test data that I was using, I wanted ranks from 1-10 for values that IS NOT NULL, but when I used that particular snippet of code, it gave me 1-5 for the columns that had data and no ranking for NULL values. Not quite what I was after.

So, I had to partition the NTILE, one partition for NULL values, and one partition for NOT NULL values. Then, it my select statement for insert, I use the CASE statement. If it is NULL, then don't enter a ranking. If it is NOT NULL, enter a ranking.

Make sense?
Post #1763173
Posted Monday, February 22, 2016 12:14 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 10:35 AM
Points: 1,871, Visits: 6,277
EdgeOfKnowWare (2/22/2016)
No, that's not quite what I was trying to do.

The problem I had (which lead me to this post via Google) was that I had several different rankings dependent upon several different columns. I didn't want to have to read through the table each time for each column, but I was having the same problem the original poster had. When I was testing my solution, I tried the code you supplied (great idea, by the way), but it didn't give me a full ranking. In the test data that I was using, I wanted ranks from 1-10 for values that IS NOT NULL, but when I used that particular snippet of code, it gave me 1-5 for the columns that had data and no ranking for NULL values. Not quite what I was after.

So, I had to partition the NTILE, one partition for NULL values, and one partition for NOT NULL values. Then, it my select statement for insert, I use the CASE statement. If it is NULL, then don't enter a ranking. If it is NOT NULL, enter a ranking.

Make sense?


Yep that makes sense and what you posted is a good solution. Another way you can do what you are trying to do is like this:

SELECT *, 0,
NTILE(4) OVER (ORDER BY (SELECT NULL)) As Q -- using "ORDER BY (SELECT NULL)" eliminates the sort here
FROM #t
WHERE val IS NULL
UNION ALL
SELECT *, 1,
NTILE(4) OVER (ORDER BY val) As Q -- the POC index will handle sorting here
FROM #t
WHERE val IS NOT NULL;

It's a little more code but will get you the same results as your query but without the need to sort the NULL values (No reason to sort a bunch of NULLs).

If you add a POC index (see this Itzek Ben Gan article for more info about POC indexes), run what I posted along with your solution, you'll see that my solution is able to utilize the POC index and eliminate all the sorts from the query plan. Partitioning by a case statement, on the other hand, will almost certainly guarantee a sort operator which you don't want. Here's all the code with comments:

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL
);

INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT 5;
INSERT INTO #t SELECT 7;
INSERT INTO #t SELECT 8;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 11;
INSERT INTO #t SELECT 12;

-- The P.O.C. Index
CREATE NONCLUSTERED INDEX nc_t_temp ON #t(val) INCLUDE (ID);
GO

-- Run both queries with "Show Actual Execution plan turned on"

-- Original Solution
SELECT *,
CASE WHEN val IS NULL THEN 0 ELSE 1 END,
NTILE(4) OVER (PARTITION BY CASE WHEN val IS NULL THEN 0 ELSE 1 END ORDER BY val) As Q
FROM #t;

-- Alternate solution which will not require any sorting
SELECT *, 0,
NTILE(4) OVER (ORDER BY (SELECT NULL)) As Q -- using "ORDER BY (SELECT NULL)" eliminates the sort here
FROM #t
WHERE val IS NULL
UNION ALL
SELECT *, 1,
NTILE(4) OVER (ORDER BY val) As Q -- the POC index will handle sorting here
FROM #t
WHERE val IS NOT NULL;

Just a little food for thought.



Edit.... Further simplified my solution by replacing "CASE WHEN val IS NULL THEN 0 ELSE 1 END" with a static value.


-- Alan Burstein



Best practices for getting help on SQLServerCentral
Need to split a string? Try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL 2012+)
Need a pattern-based splitter? Try PatternSplitCM
Need to remove of replace those unwanted characters? Try PatExclude8K and PatReplace8K.

"I can't stress enough the importance of switching from a 'sequential files' mindset to 'set-based' thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. " -- Itzek Ben-Gan 2001
Post #1763197
Posted Monday, February 22, 2016 12:37 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, May 9, 2016 9:28 AM
Points: 34, Visits: 283
Alan, I agree with you, if you're just wanting one ranking. If we expand upon the Original Posters issue (which is what I was facing) and say that you want to different groupings, how would you handle that?

So, let's take the example and add a column.

--	NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
-- displacing other values to higher tiles.

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL,
val2 INT NULL
);

-- INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL,1;
INSERT INTO #t SELECT NULL,2;
INSERT INTO #t SELECT NULL,3;
INSERT INTO #t SELECT 5,4;
INSERT INTO #t SELECT 7,5;
INSERT INTO #t SELECT 8,6;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 11,NULL;
INSERT INTO #t SELECT 12,NULL;

-- Compute quartiles.
-- With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
-- Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
-- The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:


Suppose you want the results of this:

ID	val	Q	val2	Q2
1 NULL NULL 1 1
2 NULL NULL 2 1
3 NULL NULL 3 2
4 5 1 4 2
5 7 1 5 3
6 8 2 6 4
7 9 2 NULL NULL
8 9 3 NULL NULL
9 11 3 NULL NULL
10 12 4 NULL NULL


How would you write the query then? That's the direction I was going.
Post #1763202
Posted Monday, February 22, 2016 1:18 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 10:35 AM
Points: 1,871, Visits: 6,277
EdgeOfKnowWare (2/22/2016)
Alan, I agree with you, if you're just wanting one ranking. If we expand upon the Original Posters issue (which is what I was facing) and say that you want to different groupings, how would you handle that?

So, let's take the example and add a column.

--	NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
-- displacing other values to higher tiles.

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL,
val2 INT NULL
);

-- INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL,1;
INSERT INTO #t SELECT NULL,2;
INSERT INTO #t SELECT NULL,3;
INSERT INTO #t SELECT 5,4;
INSERT INTO #t SELECT 7,5;
INSERT INTO #t SELECT 8,6;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 11,NULL;
INSERT INTO #t SELECT 12,NULL;

-- Compute quartiles.
-- With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
-- Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
-- The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:


Suppose you want the results of this:

ID	val	Q	val2	Q2
1 NULL NULL 1 1
2 NULL NULL 2 1
3 NULL NULL 3 2
4 5 1 4 2
5 7 1 5 3
6 8 2 6 4
7 9 2 NULL NULL
8 9 3 NULL NULL
9 11 3 NULL NULL
10 12 4 NULL NULL


How would you write the query then? That's the direction I was going.


For something like that I would do this:

WITH 
Q1 AS
(
SELECT
ID,
val,
Q = NULL,
val2
FROM #t
WHERE val IS NULL
UNION ALL
SELECT
ID,
val,
Q = NTILE(4) OVER (ORDER BY val),
val2
FROM #t
WHERE val IS NOT NULL
),
Q2 AS
(
SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)
FROM Q1
WHERE val2 IS NOT NULL
)
SELECT
Q1.ID,
val,
Q,
Q1.val2,
Q2
FROM Q1
LEFT JOIN Q2 ON Q1.ID = Q2.ID;

If you went that route you would really want this index:

CREATE NONCLUSTERED INDEX nc_t_val1_temp ON #t(val, val2) INCLUDE (ID);

and would be able to further improve performance using this index:
CREATE NONCLUSTERED INDEX nc_t_val2_temp ON #t(val2) INCLUDE (ID, val);



-- Alan Burstein



Best practices for getting help on SQLServerCentral
Need to split a string? Try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL 2012+)
Need a pattern-based splitter? Try PatternSplitCM
Need to remove of replace those unwanted characters? Try PatExclude8K and PatReplace8K.

"I can't stress enough the importance of switching from a 'sequential files' mindset to 'set-based' thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. " -- Itzek Ben-Gan 2001
Post #1763210
« Prev Topic | Next Topic »

Add to briefcase 123»»»

Permissions Expand / Collapse