What's the logic of the NOT IN clause?

  • I thought I understood the "NOT IN" logic very well until I run into a case where the presence of NULL values affects the result in a way I was not expecting. The setup is super simple:

    CREATE TABLE A1(ID1 INT);
    CREATE TABLE A2(ID2 INT);

    INSERT INTO A1(ID1) VALUES (1), (2), (NULL);
    INSERT INTO A2(ID2) VALUES (1), (3), (NULL);

    The surprising (to me) fact is that the following query:

    SELECT ID1 FROM A1  
    WHERE ID1 NOT IN (SELECT ID2 FROM A2);

    returns no rows, when thought it should return the row where ID1 = 2. And changing the query to:

    SELECT ID1 FROM A1  
    WHERE ID1 NOT IN (SELECT ID2 FROM A2 WHERE ID2 IS NOT NULL);

    returns the (expected) row with ID1 =2.

    So it looks like "ID1 NOT IN " is translated as (ID1 <> 1 AND ID1 <> 3 AND ID1 <> NULL), instead of  NOT (ID1 = 1 OR ID1 = 3 OR ID1=NULL), is that right?

    • This topic was modified 1 year, 4 months ago by  Sam-263310.
  • The problem is documented (well, kind of) in the Microsoft Documentation.  They "cop" out on explaining exactly what could happen by "only" saying that having a NULL in the object of the IN can produce "unexpected results".

    https://learn.microsoft.com/en-us/sql/t-sql/language-elements/in-transact-sql

    If you look at the Nested Loops operator in the Actual Execution Plan, the reason becomes crystal clear... if you haven't messed up your settings in SQL Server, a NULL cannot be compared to anything else with a relational operator and so the predicate ends up looking like what's in the WHERE clause of the code below (I included the results from both tables so you can see why... change the * to  A2.* to see what's returned for A2)...

     SELECT * FROM A1,A2
    WHERE [A1].[ID1] IS NULL
    OR [A2].[ID2] IS NULL
    OR [A1].[ID1]=[A2].[ID2]

    That's also why your "NOT NULL" addition worked.

    Basically, a NULL in the object of the IN turns it into a FULL OUTER JOIN and something will ALWAYS be returned from A2, unless otherwise excluded like you did with your NOT NULL addition.

    Having a NULL in the outer table (A1 in this case), won't cause the same problem.  Any NULL in the object of the IN comparison operation will.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes, the query plan makes it clear, so I have to accept it, but the result still feels "wrong" to me because 2 is not in the result set (1,3,NULL). Anyway, it is what it is. Thank you.

    • This reply was modified 1 year, 4 months ago by  Sam-263310.
  • I agree... since NULL is not "nothing" nor even equal to another NULL, I don't know why they made it evaluate for NULL behind the scenes at all or why they didn't make it where IS NOT NULL AND...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just a quick thought, if one uses ISNULL one can lessen the internal cardinality from the NOT IN subquery that are pushed into the nested loop.

    😎

    Here is an example:

    USE TEEST;
    GO
    SET NOCOUNT ON;
    GO
    DECLARE @A1 TABLE (ID1 INT);
    DECLARE @A2 TABLE (ID2 INT);
    INSERT INTO @A1(ID1) VALUES (1), (2), (NULL);
    INSERT INTO @A2(ID2) VALUES (1), (3), (NULL);

    SELECT A1.ID1 FROM @A1 A1
    WHERE ISNULL(A1.ID1,-1) NOT IN
    (
    SELECT ISNULL(A2.ID2,-2) FROM @A2 A2
    )
    AND A1.ID1 IS NOT NULL;

    SELECT A1.ID1 FROM @A1 A1
    WHERE A1.ID1 NOT IN (SELECT A2.ID2 FROM @A2 A2 WHERE A2.ID2 IS NOT NULL);

    I must say that I find it unfortunate that SQL Server handles NULL as UNKNOWN of the lowest possible numerical value, how can an UNKNOWN value be handled as a value when it is UNKNOWN?

  • Maybe I am off-base here, but this really isn't about how Microsoft handles NULLs, rather it is more simply a logic problem.

    When using IN - match on ANY item in the list which translates to OR condition.

    When using NOT IN - does not match ALL items which translates to AND condition.

    For an OR condition, the NULL doesn't matter and we get a result if the value compared is in the list (unless the value is NULL since NULL <> NULL).  For the AND condition - nothing will ever match a NULL so it fails for any value being compared to the list.

    Another method of writing the NOT IN:

    SELECT A1.ID1 FROM @A1  A1  
    WHERE A1.ID1 <> ALL (SELECT A2.ID2 FROM @A2 A2);

    I may be wrong - but for me this isn't an implementation issue or something that MS got 'wrong'.

     

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • For me, it's very much something that MS got wrong.  They're even trying to make it so that the NULL settings are not optional and Front-Enders are ready to march on Redmond because of it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @JeffreyWilliams Yes, the query behaves as if one had written the AND condition you mention (see the last part of the original post, where I made the same observation), but "NOT IN" could also be evaluated by doing the "IN" condition first and then negating the result, and the question is then why isn't that the preferred way? notice that per Jeff Modem's answer, the actual query used by SQL Server is very different.

     

     

  • @Eirikur That works, of course, but SQL Server won't be able to use any index for that query.

  • You can always use EXCEPT

    SELECT ID1 FROM A1
    EXCEPT
    SELECT ID2 FROM A2
  • Sam-263310 wrote:

    @Eirikur That works, of course, but SQL Server won't be able to use any index for that query.

    Using indices is not necessarily the same as having better performance, this is a fine example of that exception.

    😎

    I'll rig up a test harness for demonstration and post it here 😉

  • Eirikur Eiriksson wrote:

    Sam-263310 wrote:

    @Eirikur That works, of course, but SQL Server won't be able to use any index for that query.

    Using indices is not necessarily the same as having better performance, this is a fine example of that exception. 😎 I'll rig up a test harness for demonstration and post it here 😉

    Ah.. while I totally understand and appreciate what you're saying there, be careful.  Just like Knuth's parable about "pre-optimization", a lot of people will take something like that and run in the totally wrong direction with it.  I'm very much looking forward to you demo.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Just putting the money where the mouth is 😉

    😎

    Test result sets:

    -----------------------------------------------------------------------------------------------------------------
    With non-clustered indices on both columns
    -----------------------------------------------------------------------------------------------------------------
    T_TXT CPU IDLE IO DURATION SAMPLE_SIZE
    -------------------- ---------------------- ---------------------- ---------------------- ----------- -----------
    ISNULL 62500 437500 0 78117 500000
    EXCEPT 62500 687500 0 93644 500000
    NOT IN 125000 1125000 0 140633 500000
    DRY RUN 281250 2468750 0 343712 500000
    CREATE INDICES 750000 2968750 31250 470331 500000
    CREATE TEST SET 687500 5812500 0 830595 500000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 125000 1125000 0 141133 1000000
    EXCEPT 125000 1125000 0 187602 1000000
    NOT IN 281250 2218750 0 297118 1000000
    DRY RUN 562500 4437500 0 641074 1000000
    CREATE INDICES 1500000 5250000 0 876663 1000000
    CREATE TEST SET 1437500 11562500 0 1657997 1000000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 562500 5187500 0 718777 5000000
    EXCEPT 812500 6187500 0 907878 5000000
    NOT IN 1437500 10562500 0 1516742 5000000
    CREATE INDICES 7843750 10843750 62500 2384447 5000000
    DRY RUN 3312500 20156250 31250 3038696 5000000
    CREATE TEST SET 7343750 59468750 187500 8578661 5000000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 1468750 10781250 0 1579692 10000000
    EXCEPT 1718750 13031250 0 1861461 10000000
    NOT IN 2750000 20750000 0 3033735 10000000
    CREATE INDICES 15843750 18125000 281250 4361081 10000000
    DRY RUN 6593750 40437500 218750 6041686 10000000
    CREATE TEST SET 14625000 118687500 187500 17113108 10000000
    -----------------------------------------------------------------------------------------------------------------
    Without non-clustered indices on both columns
    -----------------------------------------------------------------------------------------------------------------
    T_TXT CPU IDLE IO DURATION SAMPLE_SIZE
    -------------------- ---------------------- ---------------------- ---------------------- ----------- -----------
    ISNULL 31250 31250 0 15723 50000
    EXCEPT 31250 218750 0 31422 50000
    CREATE TEST SET 62500 687500 0 93818 50000
    NOT IN 187500 1562500 0 234265 50000
    DRY RUN 218750 2281250 0 312655 50000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 31250 218750 0 31449 100000
    EXCEPT 31250 468750 0 62694 100000
    CREATE TEST SET 187500 1062500 0 172295 100000
    NOT IN 625000 4375000 0 642562 100000
    DRY RUN 718750 5531250 0 799844 100000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 93750 656250 0 93739 500000
    EXCEPT 93750 656250 0 109749 500000
    CREATE TEST SET 750000 6000000 0 862358 500000
    NOT IN 10218750 76000000 31250 11033902 500000
    DRY RUN 10812500 80156250 31250 11676433 500000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 156250 1343750 0 171763 1000000
    EXCEPT 218750 1281250 0 203106 1000000
    CREATE TEST SET 1406250 11843750 0 1704699 1000000
    NOT IN 38343750 286093750 62500 41519872 1000000
    DRY RUN 39468750 289968750 62500 42194149 1000000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 812500 6406250 31250 923459 5000000
    EXCEPT 1218750 9031250 0 1314141 5000000
    CREATE TEST SET 7343750 58593750 62500 8431584 5000000
    NOT IN 596500000 4462312500 437500 647572582 5000000
    DRY RUN 599437500 4476343750 468750 649767308 5000000
    -----------------------------------------------------------------------------------------------------------------
    ISNULL 2375000 17625000 0 2563753 10000000
    EXCEPT 2437500 18531250 31250 2690276 10000000
    CREATE TEST SET 17750000 139031250 218750 20103493 10000000
    NOT IN 1215656250 9075781250 812500 1317398739 10000000
    DRY RUN 1220781250 9103218750 1000000 1321598135 10000000
    -----------------------------------------------------------------------------------------------------------------

    And the test harness:

    USE TEEST;
    GO
    SET NOCOUNT ON;
    GO
    DECLARE @SAMPLE_SIZE INT = 10000000;
    DECLARE @NULL_SEED_1 INT = 5;
    DECLARE @NULL_SEED_2 INT = 7;
    DECLARE @INT_BUCKET1 INT = 0;
    DECLARE @INT_BUCKET2 INT = 0;
    DECLARE @TIMER TABLE (
    T_TXT VARCHAR(20) NOT NULL
    ,T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME())
    ,T_CPU FLOAT NOT NULL DEFAULT (@@CPU_BUSY * CONVERT(FLOAT,@@TIMETICKS,0))
    ,T_IDLE FLOAT NOT NULL DEFAULT (@@IDLE * CONVERT(FLOAT,@@TIMETICKS,0))
    ,T_IO FLOAT NOT NULL DEFAULT (@@IO_BUSY * CONVERT(FLOAT,@@TIMETICKS,0))
    );

    --/*
    IF OBJECT_ID('dbo.TBL_TEST_NOTIN_1') IS NOT NULL DROP TABLE dbo.TBL_TEST_NOTIN_1;
    CREATE TABLE dbo.TBL_TEST_NOTIN_1
    (
    T_NUM INT NULL
    )
    ;
    IF OBJECT_ID('dbo.TBL_TEST_NOTIN_2') IS NOT NULL DROP TABLE dbo.TBL_TEST_NOTIN_2;
    CREATE TABLE dbo.TBL_TEST_NOTIN_2
    (
    T_NUM INT NULL
    )
    ;
    INSERT INTO @TIMER(T_TXT) VALUES ('CREATE TEST SET');
    ;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE)
    ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
    INSERT INTO dbo.TBL_TEST_NOTIN_1 WITH (TABLOCKX) (T_NUM)
    SELECT
    CASE
    WHEN NM.N % @NULL_SEED_1 > 0 THEN NM.N
    ELSE NULL
    END
    FROM NUMS NM
    ;
    ;WITH T(N) AS (SELECT X.N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) X(N))
    , NUMS(N) AS (SELECT TOP(@SAMPLE_SIZE)
    ROW_NUMBER() OVER (ORDER BY @@VERSION) AS N FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)
    INSERT INTO dbo.TBL_TEST_NOTIN_2 WITH (TABLOCKX) (T_NUM)
    SELECT
    CASE
    WHEN NM.N % @NULL_SEED_2 > 0 THEN NM.N
    ELSE NULL
    END
    FROM NUMS NM
    ;
    INSERT INTO @TIMER(T_TXT) VALUES ('CREATE TEST SET');

    INSERT INTO @TIMER(T_TXT) VALUES ('CREATE INDICES');
    CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_TEST_NOTIN_1_T_NUM ON dbo.TBL_TEST_NOTIN_1 (T_NUM)
    CREATE NONCLUSTERED INDEX NCLIDX_DBO_TBL_TEST_NOTIN_2_T_NUM ON dbo.TBL_TEST_NOTIN_2 (T_NUM)
    INSERT INTO @TIMER(T_TXT) VALUES ('CREATE INDICES');
    -- */
    INSERT INTO @TIMER(T_TXT) VALUES ('DRY RUN');
    SELECT
    @INT_BUCKET1 = A1.T_NUM
    ,@INT_BUCKET2 = A2.T_NUM
    FROM dbo.TBL_TEST_NOTIN_1 A1
    LEFT OUTER JOIN dbo.TBL_TEST_NOTIN_2 A2
    ON A1.T_NUM = A2.T_NUM;
    INSERT INTO @TIMER(T_TXT) VALUES ('DRY RUN');

    INSERT INTO @TIMER(T_TXT) VALUES ('NOT IN');
    SELECT
    @INT_BUCKET1 = A1.T_NUM
    FROM dbo.TBL_TEST_NOTIN_1 A1
    WHERE A1.T_NUM NOT IN (SELECT A2.T_NUM FROM dbo.TBL_TEST_NOTIN_2 A2 WHERE A2.T_NUM IS NOT NULL)
    OPTION (MAXDOP 1);
    INSERT INTO @TIMER(T_TXT) VALUES ('NOT IN');

    INSERT INTO @TIMER(T_TXT) VALUES ('EXCEPT');
    ;WITH EXCEPT_SET(T_NUM) AS
    (
    SELECT
    A1.T_NUM
    FROM dbo.TBL_TEST_NOTIN_1 A1
    EXCEPT
    SELECT A2.T_NUM FROM dbo.TBL_TEST_NOTIN_2 A2
    )
    SELECT
    @INT_BUCKET1 = ES.T_NUM
    FROM EXCEPT_SET ES
    OPTION (MAXDOP 1);
    INSERT INTO @TIMER(T_TXT) VALUES ('EXCEPT');

    INSERT INTO @TIMER(T_TXT) VALUES ('ISNULL');
    SELECT
    @INT_BUCKET1 = A1.T_NUM
    FROM dbo.TBL_TEST_NOTIN_1 A1
    WHERE ISNULL(A1.T_NUM,-1) NOT IN
    (
    SELECT ISNULL(A2.T_NUM,-2) FROM dbo.TBL_TEST_NOTIN_2 A2
    )
    AND A1.T_NUM IS NOT NULL
    OPTION (MAXDOP 1);
    INSERT INTO @TIMER(T_TXT) VALUES ('ISNULL');


    SELECT
    T.T_TXT
    ,MAX(T.T_CPU) - MIN(T.T_CPU) AS CPU
    ,MAX(T.T_IDLE) - MIN(T.T_IDLE) AS IDLE
    ,MAX(T.T_IO) - MIN(T.T_IO) AS IO
    ,DATEDIFF(MICROSECOND,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION
    ,@SAMPLE_SIZE AS SAMPLE_SIZE
    FROM @TIMER T
    GROUP BY T.T_TXT
    ORDER BY DURATION ASC;

    Please play around with the tests, add your tweaks and have fun 😉

    Edit: Had to correct code pasting errors.

  • Jeff Moden wrote:

    Eirikur Eiriksson wrote:

    Sam-263310 wrote:

    @Eirikur That works, of course, but SQL Server won't be able to use any index for that query.

    Using indices is not necessarily the same as having better performance, this is a fine example of that exception. 😎 I'll rig up a test harness for demonstration and post it here 😉

    Ah.. while I totally understand and appreciate what you're saying there, be careful.  Just like Knuth's parable about "pre-optimization", a lot of people will take something like that and run in the totally wrong direction with it.  I'm very much looking forward to you demo.

    Now when are we going to meet up so you can buy me a burger and I can show you the demo 😉

    😎

     

  • DesNorton wrote:

    You can always use EXCEPT

    SELECT ID1 FROM A1
    EXCEPT
    SELECT ID2 FROM A2

    Be aware that the EXCEPT method adds a stream aggregate operator that is more costly than the scalar operators in the ISNULL plan;)

    😎

     

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply