

Say Hey Kid
Group: General Forum Members
Last Login: Tuesday, October 28, 2014 10:41 AM
Points: 706,
Visits: 3,021


I just discovered the hard way (on 2005, replicated on 2008R2) that, if you compute NTILE() on a column of data in which some of the values are NULL, all of the NULL inputs will be assigned to a tile. This makes no sense to me: NULL means unknown, so I was expecting SQL Server to skip those rows and compute NTILE() only for nonNULL rows and assign those tiles based solely on the nonNULL values. Nope!
What this means in practice: NULLs apparently get assigned to the lowest tile, which pushes all of your nonNULL, real data up in ranking and will almost certainly assign some of them to a higher tile than you might expect.
Here's code to reproduce the behavior:
 NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,  displacing other values to higher tiles.
IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t; CREATE TABLE #t ( ID INT IDENTITY(1,1) PRIMARY KEY, val INT NULL );
 INSERT 10 values: 3 NULL, 7 integers INSERT INTO #t SELECT NULL; INSERT INTO #t SELECT NULL; INSERT INTO #t SELECT NULL; INSERT INTO #t SELECT 5; INSERT INTO #t SELECT 7; INSERT INTO #t SELECT 8; INSERT INTO #t SELECT 9; INSERT INTO #t SELECT 9; INSERT INTO #t SELECT 11; INSERT INTO #t SELECT 12;
 Compute quartiles.  With 7 nonNULL values, this should have 2221 rows per quartile and the NULLs ignored (to my thinking).  Instead, we get back tiles based on 10 rows of input, with 3322 rows per quartile.  The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1: SELECT *, NTILE(4) OVER (ORDER BY val) As Q FROM #t;
and here's what the query returns: ID val Q  1 NULL 1 2 NULL 1 3 NULL 1 4 5 2 5 7 2 6 8 2 7 9 3 8 9 3 9 11 4 10 12 4
I see no mention of this on BOL.
Is this common knowledge and I'm just slow? I could not find any discussion of this anywhere.
Thanks if anyone can explain to me why this works this way, Rich




Hall of Fame
Group: General Forum Members
Last Login: Yesterday @ 3:02 AM
Points: 3,421,
Visits: 5,361


Remarks from BOL:
If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.
So it seems that only the total number of rows is important not the total number of nonNULL rows.
I agree that the behavior seems unusual but I don't see it being in conflict with the remarks.
Note that ROW_NUMBER() works the same way.
My mantra: No loops! No CURSORs! No RBAR! Hoouh!
My thought question: Have you ever been told that your query runs too fast?
My advice: INDEXing a poorperforming query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it? The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead? Since random numbers are too important to be left to chance, let's generate some! Learn to understand recursive CTEs by example. Splitting strings based on patterns can be fast!




Say Hey Kid
Group: General Forum Members
Last Login: Tuesday, October 28, 2014 10:41 AM
Points: 706,
Visits: 3,021


Thanks Dwain.c.
No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.
How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.
Thanks, Rich




Hall of Fame
Group: General Forum Members
Last Login: Yesterday @ 3:02 AM
Points: 3,421,
Visits: 5,361


rmechaber (10/28/2013) Thanks Dwain.c.
No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.
How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.
Thanks, Rich
I did say that "I agree that the behavior seems unusual" and I'll go further to say that your contention about the statistical significance of the NULLs is correct. You do always have the option to eliminate the NULLs with a WHERE clause, but obviously that doesn't change the counterintuitive nature of Microsoft's implementation decision.
My mantra: No loops! No CURSORs! No RBAR! Hoouh!
My thought question: Have you ever been told that your query runs too fast?
My advice: INDEXing a poorperforming query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it? The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.
Need to UNPIVOT? Why not CROSS APPLY VALUES instead? Since random numbers are too important to be left to chance, let's generate some! Learn to understand recursive CTEs by example. Splitting strings based on patterns can be fast!



