SQLServerCentral Article

Not In v Not Equal

,

NOT IN vs. Not Equal (<>)

A coworker recently asked me which was "more efficient - a bunch of <> or a NOT IN clause?" The answer, I assumed, like almost all things relating to databases, is "it depends." The nice thing about that answer is that it not only camouflages my vast stores of ignorance, it is also quite often true.

This time, however, I thought I'd do some investigation to see if, in fact, we should prefer one method over the other in this case. The answer this time was a bit unexpected. It turns out that they are actually the same query, and it should make absolutely no difference. This is because SQL is a declarative language, meaning: you tell the computer what you want, not how to get it [1].

The query engine takes both of these queries and performs them with the exact same sequence of events. In actuality, I did find that one of the queries typically outperformed the other. I also discovered several other methods (written by people much smarter than myself) that were significantly quicker then either of the two methods I set out to test (the quickest method I tested was a full 35% faster than the quicker of the original two).

With all that in mind, let's set off to find out if NOT IN is quicker than a "bunch of <>" and see a few methods that improve in their performance. The original question was accompanied by a snippet of a WHERE clause evaluating the inequality of a column and four integer values:

   s.Status_SV <> 214  
   and  
    s.Status_SV <> 215  
   and  
    s.Status_SV <> 216  
   and  
    s.Status_SV <> 217

Now that we know we're dealing with integer data, we'll set up a table with an integer column, then we'll set up a timing and iteration framework for tracking our query times over multiple executions. Just to make our queries lengthy enough to be nicely measurable, I arbitrarily chose to insert 1,000,000. The test results were less conclusive at 10,000 rows (with one notable exception), but the trends we see at 1,000,000 rows are clearly discernable by the time we have 100,000 rows.

Let's set up our table:

   PRINT 'Creating sample table...'
   CREATE TABLE [dbo].[tbl_IN_VS_AND]
   (filterCriterion_sv int identity(1,1) PRIMARY KEY CLUSTERED NOT NULL)
   DECLARE @i int
   SET @i = 0
   PRINT 'Populating sample table (1.000.000 rows.  This could take several minutes)...'
   WHILE @i < 1000000
   BEGIN
      INSERT dbo.tbl_IN_VS_AND DEFAULT VALUES
      SET @i = @i + 1
   END

And our execution framework:

SET NOCOUNT ON 
   DECLARE @startTime datetime, @endTime datetime
   DECLARE @results int, @iterations int
   DECLARE @i int
   SET @iterations = 100
   ---- setup is finished.  let's run some tests
   PRINT 'Beginning test run...'
   SET @i = 0
   SET @startTime = getDate()
   WHILE @i < @iterations
   BEGIN
   /*
   test queries go within this loop
   */      SET @i = @i + 1
   END
   SET @endTime = getDate()
   PRINT 'Elapsed Time: ' + cast(datediff(ms, @startTime, @endTime) AS varchar) + ' ms'

Now we'll write two queries to exercise our original question (NOT IN vs. <>). We'll start with the NOT IN query, then follow up with the AND <> query:

   SELECT @results = count(filterCriterion_sv) 
   FROM tbl_IN_VS_AND
   WHERE filterCriterion_sv NOT IN (214, 215, 216, 217)
   SELECT @results = count(filterCriterion_sv) 
   FROM tbl_IN_VS_AND
   WHERE filterCriterion_sv <> 214
   AND filterCriterion_sv <> 215
   AND filterCriterion_sv <> 216
   AND filterCriterion_sv <> 217

The NOT IN() certainly wins for conciseness. Before running a lot of iterations, let's look at snippets of the query plan text so we know SQL is doing the same thing under the covers for each query.

NOT IN():
       |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1005],0)))
            |--Stream Aggregate(DEFINE:([globalagg1005]=SUM([partialagg1004])))
                 |--Parallelism(Gather Streams)
                      |--Stream Aggregate(DEFINE:([partialagg1004]=Count(*)))
                           |--Clustered Index 
Scan(OBJECT:([master].[dbo].[tbl_IN_VS_AND].[PK__tbl_IN_VS_AND__3B4BBA2E]),
 WHERE:([master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(214) AND 
 [master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(215) AND [

AND <>:

       |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[globalagg1005],0)))
            |--Stream Aggregate(DEFINE:([globalagg1005]=SUM([partialagg1004])))
                 |--Parallelism(Gather Streams)
                      |--Stream Aggregate(DEFINE:([partialagg1004]=Count(*)))
                           |--Clustered Index 
Scan(OBJECT:([master].[dbo].[tbl_IN_VS_AND].[PK__tbl_IN_VS_AND__3B4BBA2E]), 
WHERE:([master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(214) AND 
[master].[dbo].[tbl_IN_VS_AND].[filterCriterion_sv]<>(215) AND [

It turns out that SQL likes the AND <> so much, it converts the NOT IN() clause into a series of AND <> clauses. After 100 executions of each query, it seems that execution times for the AND <> query tend to be lower than those for the NOT IN query (conversion overhead, maybe?):

   Beginning first test run...
   "NOT IN" ET: 46170 ms
   Beginning second test run...
   "AND <>" ET: 42326 ms

I ran the same series of tests on another occasion and the NOT IN query consistently outperformed the AND <> query. The results regularly go back and forth, much like the heads and tails of a coin toss. So, I have managed to convince myself that, despite the two execution times listed above, these two queries are, indeed, the same query as far as SQL Server is concerned -- at least on this day, on this server, for these queries (I still gravitate toward the comforting ambiguity of "it depends").

I mentioned earlier that there were several queries that outperform our basic AND <> and NOT IN queries (on this server on this day). Let's take a look at some of those queries and their execution results. The first alternative technique doesn't use a WHERE clause to filter out our integer values. It places the integer values into a UNION query and does a LEFT OUTER JOIN against that to filter out unequal rows. Here is what that query looks like:

   SELECT @results = count(filterCriterion_sv) 
   FROM tbl_IN_VS_AND
   LEFT OUTER JOIN (
         SELECT 214 AS filterValue_val UNION
         SELECT 215 UNION
         SELECT 216 UNION
         SELECT 217 ) AS tbl 
      ON tbl_IN_VS_AND.filterCriterion_sv = tbl.filterValue_val
   WHERE tbl.filterValue_val IS NULL

It definitely feels odd, placing things you would normally put in a WHERE clause into a derived table then looking for absent values, but the performance benefit gives us a compelling reason to consider doing this. On this test run of 100 executions, this odd query was consistently outperforming the quicker of our original two queries by about 19%:

   Beginning fourth test run...
   "derived UNION table LEFT OUTER JOIN" ET: 34360 ms

Our last query happened to be our best performing (on this server, on this day). Like the previous query, this one uses a derived table. However, it takes it one step further and nests that inside an IF NOT EXISTS(). Let's take a look at it:

   SELECT @results = count(filterCriterion_sv) 
   FROM tbl_IN_VS_AND
   WHERE NOT EXISTS(SELECT * FROM
      (
         SELECT 214 AS filterValue_val UNION ALL
         SELECT 215 UNION ALL
         SELECT 216 UNION ALL
         SELECT 217 ) AS tbl 
      WHERE tbl.filterValue_val = tbl_IN_VS_AND.filterCriterion_sv )

And here is the time it took for 100 executions:

   Beginning seventh test run...
   "NOT EXISTS from derived UNION table" ET: 27920 ms

On this day, on this server, this query runs a full 35% faster than the quicker of our two original queries. Are there even faster ways to run this query? I would say the odds are pretty good that there are. However, I must admit that we have exhausted the current depth of my knowledge and experience, and that was only thanks to the derived table and NOT EXISTS techniques I discovered in some old newsgroup postings [2] [3]. I hope you'll be able to successfully adapt these techniques to your environment and always remember to do a little experimenting and draw performance conclusions based on testing with your servers and your data.

Footnotes

[1] - http://en.wikipedia.org/wiki/Declarative_programming

[2] - http://groups.google.com/group/microsoft.public.sqlserver.programming/browse_thread/thread/34101ff50413c7d6/45a02da0dc7b3ced

[3] - http://groups.google.com/group/microsoft.public.sqlserver.server/browse_thread/thread/dfb1d7e766d89069/ee59b6e9d94ca222

Rate

4.87 (23)

You rated this post out of 5. Change rating

Share

Share

Rate

4.87 (23)

You rated this post out of 5. Change rating