SQLServerCentral / SQL Server 2008 - General / SQL Server 2008 / Odd (to me) Behavior with NTILE() and NULL Values in Source Data / Latest PostsInstantForum.NET v2.9.0SQLServerCentralhttp://www.sqlservercentral.com/Forums/notifications@sqlservercentral.comThu, 17 Apr 2014 01:33:14 GMT20RE: Odd (to me) Behavior with NTILE() and NULL Values in Source Datahttp://www.sqlservercentral.com/Forums/Topic1508622-391-1.aspx[quote][b]rmechaber (10/28/2013)[/b][hr]Thanks Dwain.c.No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.Thanks,Rich[/quote]I did say that "I agree that the behavior seems unusual" and I'll go further to say that your contention about the statistical significance of the NULLs is correct. You do always have the option to eliminate the NULLs with a WHERE clause, but obviously that doesn't change the counter-intuitive nature of Microsoft's implementation decision.Mon, 28 Oct 2013 18:14:45 GMTdwain.cRE: Odd (to me) Behavior with NTILE() and NULL Values in Source Datahttp://www.sqlservercentral.com/Forums/Topic1508622-391-1.aspxThanks Dwain.c.No, I don't see the behavior being at odds with BOL either, but I do see it being at odds with computational integrity and statistics. And BOL has a serious omission that doesn't explain this at all.How can you compute a rank of NULL values? Why put all the NULL values at the bottom of the list when ordering to compute NTILE()? You could just as easily put them all at the top or scattered through the middle or lumped into the N'th tile where N is the day number of the week or.... OK, I do see that putting them all at the bottom is behavior that is repeatable and predictable. But it's very much like saying that the median salary of your survey's population was $0.00 b/c 51% of the survey respondents didn't provide any salary information in their answers.Thanks,RichMon, 28 Oct 2013 06:50:30 GMTrmechaberRE: Odd (to me) Behavior with NTILE() and NULL Values in Source Datahttp://www.sqlservercentral.com/Forums/Topic1508622-391-1.aspxRemarks from BOL:[I]If the number of rows in a partition is not divisible by integer_expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each. If on the other hand the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.[/I]So it seems that only the total number of rows is important not the total number of non-NULL rows.I agree that the behavior seems unusual but I don't see it being in conflict with the remarks.Note that ROW_NUMBER() works the same way.Sun, 27 Oct 2013 18:47:59 GMTdwain.cOdd (to me) Behavior with NTILE() and NULL Values in Source Datahttp://www.sqlservercentral.com/Forums/Topic1508622-391-1.aspxI just discovered the hard way (on 2005, replicated on 2008R2) that, if you compute NTILE() on a column of data in which some of the values are NULL, all of the NULL inputs will be assigned to a tile. This makes no sense to me: NULL means unknown, so I was expecting SQL Server to skip those rows and compute NTILE() only for non-NULL rows and [u]assign those tiles based solely on the non-NULL values[/u]. Nope!What this means in practice: NULLs apparently get assigned to the lowest tile, which pushes all of your non-NULL, real data up in ranking and will almost certainly assign some of them to a higher tile than you might expect.Here's code to reproduce the behavior:[code="sql"]-- NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,-- displacing other values to higher tiles.IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;CREATE TABLE #t ( ID INT IDENTITY(1,1) PRIMARY KEY, val INT NULL );-- INSERT 10 values: 3 NULL, 7 integers INSERT INTO #t SELECT NULL; INSERT INTO #t SELECT NULL; INSERT INTO #t SELECT NULL; INSERT INTO #t SELECT 5; INSERT INTO #t SELECT 7; INSERT INTO #t SELECT 8; INSERT INTO #t SELECT 9; INSERT INTO #t SELECT 9; INSERT INTO #t SELECT 11; INSERT INTO #t SELECT 12;-- Compute quartiles.-- With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).-- Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.-- The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1: SELECT *, NTILE(4) OVER (ORDER BY val) As Q FROM #t;[/code]and here's what the query returns:[font="Courier New"]ID val Q---------1 NULL 12 NULL 13 NULL 14 5 25 7 26 8 27 9 38 9 39 11 410 12 4[/font]I see no mention of this on [url=http://technet.microsoft.com/en-us/library/ms175126%28v=sql.105%29.aspx]BOL[/url].Is this common knowledge and I'm just slow? I could not find any discussion of this anywhere.Thanks if anyone can explain to me why this works this way,RichFri, 25 Oct 2013 16:26:19 GMTrmechaber