Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Odd (to me) Behavior with NTILE() and NULL Values in Source Data Expand / Collapse
Author
Message
Posted Friday, October 25, 2013 4:26 PM


Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Wednesday, November 26, 2014 9:31 AM
Points: 717, Visits: 3,037
I just discovered the hard way (on 2005, replicated on 2008R2) that, if you compute NTILE() on a column of data in which some of the values are NULL, all of the NULL inputs will be assigned to a tile. This makes no sense to me: NULL means unknown, so I was expecting SQL Server to skip those rows and compute NTILE() only for non-NULL rows and assign those tiles based solely on the non-NULL values. Nope!

What this means in practice: NULLs apparently get assigned to the lowest tile, which pushes all of your non-NULL, real data up in ranking and will almost certainly assign some of them to a higher tile than you might expect.

Here's code to reproduce the behavior:
--	NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
-- displacing other values to higher tiles.

IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL
);

-- INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT NULL;
INSERT INTO #t SELECT 5;
INSERT INTO #t SELECT 7;
INSERT INTO #t SELECT 8;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 9;
INSERT INTO #t SELECT 11;
INSERT INTO #t SELECT 12;

-- Compute quartiles.
-- With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
-- Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
-- The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:
SELECT *,
NTILE(4) OVER (ORDER BY val) As Q
FROM #t;


and here's what the query returns:
ID val Q
---------
1 NULL 1
2 NULL 1
3 NULL 1
4 5 2
5 7 2
6 8 2
7 9 3
8 9 3
9 11 4
10 12 4


I see no mention of this on BOL.

Is this common knowledge and I'm just slow? I could not find any discussion of this anywhere.

Thanks if anyone can explain to me why this works this way,
Rich
Post #1508622
Posted Sunday, October 27, 2013 6:47 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 9:53 PM
Points: 3,438, Visits: 5,390
Remarks from BOL:

If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.

So it seems that only the total number of rows is important not the total number of non-NULL rows.

I agree that the behavior seems unusual but I don't see it being in conflict with the remarks.

Note that ROW_NUMBER() works the same way.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1508724
Posted Monday, October 28, 2013 6:50 AM


Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Wednesday, November 26, 2014 9:31 AM
Points: 717, Visits: 3,037
Thanks Dwain.c.

No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.

How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.

Thanks,
Rich
Post #1508847
Posted Monday, October 28, 2013 6:14 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 9:53 PM
Points: 3,438, Visits: 5,390
rmechaber (10/28/2013)
Thanks Dwain.c.

No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.

How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.

Thanks,
Rich


I did say that "I agree that the behavior seems unusual" and I'll go further to say that your contention about the statistical significance of the NULLs is correct. You do always have the option to eliminate the NULLs with a WHERE clause, but obviously that doesn't change the counter-intuitive nature of Microsoft's implementation decision.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1509154
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse