in clause vs not in clause

  • Guys,

    I have an employee table in which each employee is categorized by employee type (up to 100 types). Currently I have 50 mill rows

    in the table I run the following query against the database.

    SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL

    FROM EMPLOYEE WHERE EMP_TYP IN (1, 2, 4, ....) - UPTO 70 EMP TYPES IN "IN CLAUSE"

    If I would change the above query to do not "in clause"

    SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL

    FROM EMPLOYEE WHERE EMP_TYP not IN (11, 23, 49, ....) - UPTO 30 EMP TYPES IN "IN CLAUSE"

    I did not notice any big difference in the performance. Is there any difference between how the SQL engine treats "IN CLAUSE" VS

    "CLAUSE" shouldnt the second query be faster.

    Any suggestions and inputs would help

    Thanks

  • hmmmm...

    " Not in " will take more time.

  • To answer this question, you should compare the execution plans of the two two queries. You might be surprised to see that both are doing a clustered index or table scan.

  • bang725 (11/3/2008)


    hmmmm...

    " Not in " will take more time.

    Heh... prove it. Got code? 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/3/2008)


    bang725 (11/3/2008)


    hmmmm...

    " Not in " will take more time.

    Heh... prove it. Got code? 😉

    With no code, no idea of the data involved, my swag is both will take about the same amount of time as both will be doing a clustered index/table scan.

    What say you Jeff, got a swag?

  • Yeah... my first swag would be that this might be a boondoggle... tell me which company in the world has 50 million employees as the op suggests. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • My second swag would be that, as you suggest, both will cause some sort of index scan.... but because the WHERE NOT IN has only 42.857% as many items to compare, the WHERE NOT IN will win the race.

    But, let's stop guessing... I'm not gonna build a 50 million row test table, but I will build a million rows test table...

    --===== Create and populate a 1,000,000 row test table.

    -- Column "RowNum" has a range of 1 to 1,000,000 unique numbers

    -- Column "SomeInt" has a range of 1 to 100 non-unique numbers

    -- Column "SomeLetters2" has a range of "AA" to "ZZ" non-unique 2 character strings

    -- Column "SomeMoney has a range of 0.0000 to 99.9999 non-unique numbers

    -- Column "SomeDate" has a range of >=01/01/2000 and <01/01/2010 non-unique date/times

    -- Column "SomeCSV" contains 'Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10'

    -- for all rows.

    -- Column "SomeHex12" contains 12 random hex characters (ie, 0-9,A-F)

    -- Jeff Moden

    SELECT TOP 1000000

    RowNum = IDENTITY(INT,1,1),

    SomeInt = ABS(CHECKSUM(NEWID()))%100+1,

    SomeLetters2 = CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    SomeCSV = CAST('Part01,Part02,Part03,Part04,Part05,Part06,Part07,Part08,Part09,Part10' AS VARCHAR(80)),

    SomeMoney = CAST(ABS(CHECKSUM(NEWID()))%10000 /100.0 AS MONEY),

    SomeDate = CAST(RAND(CHECKSUM(NEWID()))*3653.0+36524.0 AS DATETIME),

    SomeHex12 = RIGHT(NEWID(),12)

    INTO dbo.JBMTest

    FROM Master.dbo.SysColumns t1

    CROSS JOIN Master.dbo.SysColumns t2

    --===== Add a Clustered Primary Key for the test

    ALTER TABLE dbo.JBMTest

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== Add an index on the lookup column for the test

    CREATE INDEX IX_JBMTest_SomeInt ON dbo.JBMTest (SomeInt)

    ... and then we can play. I know that we could do this with "BETWEEN" or some such, but that wouldn't be testing what the op wanted to know... so here goes...

    SET STATISTICS TIME ON

    SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

    FROM dbo.JBMTest

    WHERE SomeInt IN (1,2,3,4,5,6,7,8,9,10,

    11,12,13,14,15,16,17,18,19,20,

    21,22,23,24,25,26,27,28,29,30,

    31,32,33,34,35,36,37,38,39,40,

    41,42,43,44,45,46,47,48,49,50,

    51,52,53,54,55,56,57,58,59,60,

    61,62,63,64,65,66,67,68,69,70)

    SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

    FROM dbo.JBMTest

    WHERE SomeInt NOT IN (71,72,73,74,75,76,77,78,79,80,

    81,82,83,84,85,86,87,88,89,90,

    91,92,93,94,95,96,97,98,99,100)

    SET STATISTICS TIME OFF

    From that, I get the following results...

    [font="Courier New"]

    (699558 row(s) affected)

    SQL Server Execution Times:

    CPU time = 4641 ms, elapsed time = 35836 ms.

    (699558 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3469 ms, elapsed time = 35402 ms.[/font]

    As you can see, the CPU didn't work very hard at all... the time to display is the killer and is about equal. But, do notice the CPU worked more than a second less on the WHERE NOT IN. Actual Execution Plan explains why.

    The real key here is that no matter which method you use to return 70% of the rows of a 50 million table, you'll probably run the machine out of memory and the DBA is probably going to chase you until your heals smoke or (s)he manages to get a well aimed pork chop off. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/4/2008)


    Yeah... my first swag would be that this might be a boondoggel... tell me which company in the world has 50 million employees as the op suggests. 😉

    China, Inc.? 😀

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Heh... or maybe "Stormtrooper Clones'R'Us" 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • am (11/3/2008)


    Guys,

    I have an employee table in which each employee is categorized by employee type (up to 100 types). Currently I have 50 mill rows

    in the table I run the following query against the database.

    SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL

    FROM EMPLOYEE WHERE EMP_TYP IN (1, 2, 4, ....) - UPTO 70 EMP TYPES IN "IN CLAUSE"

    If I would change the above query to do not "in clause"

    SELECT FNAME, MNAME, LNAME, SSN, EMPDATE, DOB, SEX, SAL

    FROM EMPLOYEE WHERE EMP_TYP not IN (11, 23, 49, ....) - UPTO 30 EMP TYPES IN "IN CLAUSE"

    I did not notice any big difference in the performance. Is there any difference between how the SQL engine treats "IN CLAUSE" VS

    "CLAUSE" shouldnt the second query be faster.

    Any suggestions and inputs would help

    Thanks

    It only depends on the number of rows you have in the dB and the number of resultant rows after applying the condition

    kshitij kumar
    kshitij@krayknot.com
    www.krayknot.com

  • krayknot (11/4/2008)


    It only depends on the number of rows you have in the dB and the number of resultant rows after applying the condition

    Absolutely not true. Take a look at the test code I posted... both return the exact same number of rows.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Actually Jeff - I do think there's something more afoot. I think you've managed to get it to use seeks because the IN's and the NOT IN's are all contiguous.

    If you were to try:

    SET STATISTICS TIME ON

    SELECT count(RowNum)

    FROM dbo.JBMTest

    WHERE SomeInt IN (2,4,6,8,10,

    12,14,16,18,20,

    22,24,26,28,30,

    32,34,36,38,40,

    42,44,46,48,50,

    52,54,56,58,60,

    62,64,66,68,70,

    72,74,76,78,80,

    82,84,86,88,90,

    92,94,96,98,100)

    SELECT count(RowNum)

    FROM dbo.JBMTest

    WHERE SomeInt NOT IN (1,3,5,7,9,

    11,13,15,17,19,

    21,23,25,27,29,

    31,33,35,37,39,

    41,43,45,47,49,

    51,53,55,57,59,

    61,63,65,67,69,

    71,73,75,77,79,

    81,83,85,87,89,

    91,93,95,97,99)

    SELECT count(RowNum)

    FROM dbo.JBMTest

    WHERE SomeInt IN (1,3,5,7,9,

    11,13,15,17,19,

    21,23,25,27,29,

    31,33,35,37,39,

    41,43,45,47,49,

    51,53,55,57,59,

    61,63,65,67,69,

    71,73,75,77,79,

    81,83,85,87,89,

    91,93,95,97,99)

    SELECT count(RowNum)

    FROM dbo.JBMTest

    WHERE SomeInt NOT IN (2,4,6,8,10,

    12,14,16,18,20,

    22,24,26,28,30,

    32,34,36,38,40,

    42,44,46,48,50,

    52,54,56,58,60,

    62,64,66,68,70,

    72,74,76,78,80,

    82,84,86,88,90,

    92,94,96,98,100)

    SET STATISTICS TIME OFF

    All of a sudden, you get:

    --IN evens

    SQL Server Execution Times:

    CPU time = 63 ms, elapsed time = 65 ms.

    --NOT IN odds

    SQL Server Execution Times:

    CPU time = 1170 ms, elapsed time = 595 ms.

    --IN odds

    SQL Server Execution Times:

    CPU time = 62 ms, elapsed time = 62 ms.

    --NOT IN evens.

    SQL Server Execution Times:

    CPU time = 1139 ms, elapsed time = 593 ms.

    So - it's not always the same.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Do your tests all come back with the same rowcounts? Remember... it's random data and SomeInt has different counts for different values.

    And, no... I didn't get any Index Seeks... both were Clustered Index Scans.

    I'll try your code to see what's going on there... thanks, Matt.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/5/2008)


    Do your tests all come back with the same rowcounts? Remember... it's random data and SomeInt has different counts for different values.

    And, no... I didn't get any Index Seeks... both were Clustered Index Scans.

    I'll try your code to see what's going on there... thanks, Matt.

    I forgot that you were returning most of the fields and a lot of the rows (we must be over the "tipping point" which is causing the scan). In mine - the NOT IN still keeps scanning, but the IN seeks.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Heh... I see now... first, of course the rowcounts will be/are identical. And, of course your testing shows that WHERE IN will be much faster... you changed the problem from a row retrieval to a simple count. If you go back to the original problem of returning rows, WHERE NOT IN still wins even using your good 50/50 criteria... 🙂

    SET STATISTICS TIME ON

    SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

    FROM dbo.JBMTest

    WHERE SomeInt IN (2,4,6,8,10,

    12,14,16,18,20,

    22,24,26,28,30,

    32,34,36,38,40,

    42,44,46,48,50,

    52,54,56,58,60,

    62,64,66,68,70,

    72,74,76,78,80,

    82,84,86,88,90,

    92,94,96,98,100)

    SELECT RowNum,SomeInt,SomeLetters2,SomeCSV,SomeMoney,SomeDate,SomeHex12

    FROM dbo.JBMTest

    WHERE SomeInt NOT IN (1,3,5,7,9,

    11,13,15,17,19,

    21,23,25,27,29,

    31,33,35,37,39,

    41,43,45,47,49,

    51,53,55,57,59,

    61,63,65,67,69,

    71,73,75,77,79,

    81,83,85,87,89,

    91,93,95,97,99)

    SET STATISTICS TIME OFF

    [font="Courier New"]

    (500741 row(s) affected)

    SQL Server Execution Times:

    CPU time = 4719 ms, elapsed time = 26690 ms.

    (500741 row(s) affected)

    SQL Server Execution Times:

    CPU time = 3765 ms, elapsed time = 25579 ms.

    [/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply