RE: Odd (to me) Behavior with NTILE() and NULL Values in Source Data

SSC Guru

Points: 61136

February 22, 2016 at 1:18 pm

EdgeOfKnowWare (2/22/2016)
Alan, I agree with you, if you're just wanting one ranking. If we expand upon the Original Posters issue (which is what I was facing) and say that you want to different groupings, how would you handle that?
So, let's take the example and add a column.
--NTILE (and others, presumably) gotcha: NULL inputs are included as belonging to the bottom tile,
--displacing other values to higher tiles.
IF OBJECT_ID('tempdb.dbo.#t', 'U') IS NOT NULL DROP TABLE #t;
CREATE TABLE #t
(
ID INT IDENTITY(1,1) PRIMARY KEY,
val INT NULL,
val2 INT NULL
);
--INSERT 10 values: 3 NULL, 7 integers
INSERT INTO #t SELECT NULL,1;
INSERT INTO #t SELECT NULL,2;
INSERT INTO #t SELECT NULL,3;
INSERT INTO #t SELECT 5,4;
INSERT INTO #t SELECT 7,5;
INSERT INTO #t SELECT 8,6;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 9,NULL;
INSERT INTO #t SELECT 11,NULL;
INSERT INTO #t SELECT 12,NULL;
--Compute quartiles.
--With 7 non-NULL values, this should have 2-2-2-1 rows per quartile and the NULLs ignored (to my thinking).
--Instead, we get back tiles based on 10 rows of input, with 3-3-2-2 rows per quartile.
--The lowest real values of 5 and 7 are in Q2, not Q1, b/c the 3 NULLs are all the rows in Q1:
Suppose you want the results of this:
IDvalQval2Q2
1NULLNULL11
2NULLNULL21
3NULLNULL32
45142
57153
68264
792NULLNULL
893NULLNULL
9113NULLNULL
10124NULLNULL
How would you write the query then? That's the direction I was going.

For something like that I would do this:

WITH

Q1 AS

(

SELECT

ID,

val,

Q = NULL,

val2

FROM #t

WHERE val IS NULL

UNION ALL

SELECT

ID,

val,

Q = NTILE(4) OVER (ORDER BY val),

val2

FROM #t

WHERE val IS NOT NULL

),

Q2 AS

(

SELECT ID, val2, Q2 = NTILE(4) OVER (ORDER BY val2)

FROM Q1

WHERE val2 IS NOT NULL

)

SELECT

Q1.ID,

val,

Q,

Q1.val2,

Q2

FROM Q1

LEFT JOIN Q2 ON Q1.ID = Q2.ID;

If you went that route you would really want this index:

CREATE NONCLUSTERED INDEX nc_t_val1_temp ON #t(val, val2) INCLUDE (ID);

and would be able to further improve performance using this index:

CREATE NONCLUSTERED INDEX nc_t_val2_temp ON #t(val2) INCLUDE (ID, val);

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001