The Pitfall of "Not Equal To" Operator in Queries!

  • The issue isn't the use of <> or != (which I prefer the later). The pitfall is in presumption of the data and our understanding of the question.

    The question being asked that fits your final solution is

    "Which students haven't taken the %s exam?"

    In your first solution the question is more to the effect

    "List the students who have taken exams each time but do not show the times the %s exam was taken."

    If you had used left join

    "List the students who have taken exams each time but do not show the times the %s exam was taken and make sure to show those who haven't taken an exam at all at least once."

    The biggest issue with generating data for reports is always making sure you understand the intended result fully and structure the query correctly. Some syntax is more prone than others to creating the structure that represents the question you want to pose. As always thou there can be dozens of ways to ask the same question and get the correct answer, then it is a matter of most effective way.

  • Sorry, still a good contribution thou.

  • y <> x and x <> z then y could possibly = z

    This is not a pitfall, but merely an oversight sometimes.

     

  • "Actually comparisons of a NULL (or a non-NULL value) to a NULL results in UNKNOWN.  Your point is taken though:  The WHERE clause only returns results for rows comparisons that evaluate to TRUE.  Comparisons that evaluate to FALSE or UNKNOWN are dropped from the results."

    Well now we're splitting hairs!  But, Mike is absolutely right.  BOL defines the result for those comparisons to be "UNKNOWN".  I've always considered NULL and UNKNOWN to mean the same thing and, interestingly enough, when you look up "unknown data" in BOL you get the "Null Values" overview page!  There must be some subtle distinction though, otherwise the authors would have used NULL instead of UNKNOWN.

    Sorry to take this thread so far off topic, but here's an interesting puzzler (while we're splitting hairs).  I apologize if this topic has already been covered elsewhere.

    I refrained from bringing up the ANSI_NULLS setting before, but it adds an interesting twist.  According to BOL:

    When SET ANSI_NULLS is ON, a comparison in which one or more of the expressions is NULL does not yield either TRUE or FALSE; it yields UNKNOWN.

    and

    When ANSI_NULLS is OFF, comparisons such as ColumnA = NULL return TRUE when ColumnA contains a null value and FALSE when ColumnA contains some value besides NULL. Also, a comparison of two expressions that have both evaluated to null values yields TRUE.

    To confirm this, let's try some simple TSQL.  Running the following produces what you'd expect (after reading BOL).

    set ANSI_NULLS OFF

    declare @string varchar(25)

    set @string = 'TEST'

    if @string <> NULL --compare a variable to explicit NULL

    print 'true'

    else

    print 'not true'

    set @string = NULL

    if @string <> 'TEST' --compare a NULL variable to an explicit value

    print 'true'

    else

    print 'not true'

    SELECT 'true' where @string <> 'TEST' --try a comparison in a select.

    Results

    ----

    true

    true

    true

    (1 row(s) affected)

    set ANSI_NULLS ON and you get

    Results

    ----

    not true

    not true

    (0 row(s) affected)

    So far so good.  It looks like all we have to do to retrieve Bob in Ehsan's example is set ANSI_NULLS OFF, right?

    set ANSI_NULLS OFF

    SELECT * FROM Students s left outer join StudentExam se ON s.StID = se.StID WHERE se.ExamName<>'SQL Server'

    Results

    ----

    StID StName StID ExamName

    2 Anna 2 VB.NET

    2 Anna 2 C#.NET

    1 Jack 1 XML

    (3 row(s) affected)

    Huh? Where is Bob?  Hmmm.  Let's continue to use ANSI_NULLS OFF and look for ExamName = NULL

    set ANSI_NULLS OFF

    SELECT * FROM Students s left outer join StudentExam se ON s.StID = se.StID WHERE se.ExamName=NULL

    Results

    ----

    StID StName StID ExamName

    3 Bob  

    (1 row(s) affected)

    There's Bob! 

    WHAT'S GOING ON HERE?  How come se.ExamName=NULL evaluates to TRUE but se.ExamName<>'SQL Server' doesn't (with ANSI_NULLS OFF).

    Is it because the NULL is "virtual" in the left join.  Let's try one more experiment:

    Insert Students VALUES (4,NULL)

    set ANSI_NULLS OFF

    SELECT * FROM STUDENTS WHERE StName = NULL

    Results

    ----

    StID StName

    (1 row(s) affected)

    So far so good.  Let's try the dratted "not equals to" operator that started this whole mess:

    set ANSI_NULLS OFF

    SELECT * FROM STUDENTS WHERE StName <> 'Nobody'

    Results

    ----

    StID StName

    1 Jack

    2 Anna

    3 Bob

    (3 row(s) affected)

    WHAT?  Where's the new row we just inserted?  Must be that pesky "not equals to" operator.  Double negative anyone?

    set ANSI_NULLS OFF

    SELECT * FROM STUDENTS WHERE not StName <> NULL

    Results

    ----

    StID StName

    (1 row(s) affected)

    Sure, there it is.  So it must not be the "not equals to" operator?

    My advice to anyone still reading:  BE VERY CAREFUL WITH "SET ANSI_NULLS OFF" AND ANY COMPARISONS WITH NULLS AND BE ESPECIALLY CAREFUL WITH THE <> OPERATOR AND NULLS.

     

  • The best way to think about it as "NOT EXISTS". This is the most logical human approach. You can always convert "not exists" into "left join"

     select s.* from #students s

    where not exists

    (

    select 1 from #StudentExam where stid=s.stid and Examname='SQL Server'

    )

     

  • DISTINCT is definitely a bad idea, performance-wise. Avoid if at all possible as it usually involves having to sort the result set and then scan through it, making lots of unnecessary overhead that can often be avoided.

  • Thanks indeed!
    Certainly there can be variations depending on the experience of the developer. I have tried to BOLD a situation that can lead a developer to strange result. This is a real case and I have seen this mistake by some developers during the years on my IT career.
  • Thank Duray!
    I agree! But I intended to show a mistranslation between what a human can mean and what a developer might write in TSQL! This really depends on the experience...
  • "Actually comparisons of a NULL (or a non-NULL value) to a NULL results in UNKNOWN.  Your point is taken though:  The WHERE clause only returns results for rows comparisons that evaluate to TRUE.  Comparisons that evaluate to FALSE or UNKNOWN are dropped from the results."

    Well now we're splitting hairs! 

    Actually UNKNOWN represents a possible result of a logical comparison while NULL represents a missing/unknown value.  There can be some confusion with the terminology, but NULL is actually not the same thing as UNKNOWN.  Part of the confusion pops up because the word "unknown" is so often used in the definition of NULLs.

    As an example (and to add to the confusion) consider the ANSI SQL:1999 standard which added a BOOLEAN data type to SQL.  The BOOLEAN data type can hold one of three possible values:  TRUE, FALSE and UNKNOWN:  SQL three-valued logic and all, so far so good.

    However it also has to be nullable, and to be consistent with the rest of the SQL model, NULL would have to indicate that you don't know whether the BOOLEAN column or variable is TRUE, FALSE or UNKNOWN.  I'll give a quick example.  Before someone points it out, I know this won't actually run on T-SQL... this is "pseudo-SQL" to illustrate what it might look like if ANSI BOOLEANs were implemented in T-SQL:

    DECLARE @x BOOLEAN;

    DECLARE @y BOOLEAN;

    DECLARE @z BOOLEAN;

    SELECT @x = UNKNOWN;     -- UNKNOWN

    SELECT @y = (10 < NULL); -- evaluates to UNKNOWN

    SELECT @z = NULL;        -- NULL value

    According to this example @x is equal to @y (they are both UNKNOWN).  However, is @x equal to @z?  We don't know the value of @z (is it TRUE, FALSE or UNKNOWN?), but we do know that @x is UNKNOWN.  So is UNKNOWN the same as NULL?  According to the SQL logical model the result of this comparison should be UNKNOWN, since we don't know if @z is TRUE, FALSE or UNKNOWN.  According to SQL:1999 the answer could be TRUE since they say to treat UNKNOWN as equivalent to NULL.

    To implement this correctly you actually have to re-work the SQL logical model from three-valued logic to four-valued logic everywhere   The ANSI standard just says to treat UNKNOWNs the same as NULLs, which means you're starting to introduce inconsistent exceptions into the logical model rather than expanding it to accomodate four-valued logic.

    As for the ANSI_NULLS setting, I highly recommend leaving it set to the ON position and using the IS NULL and IS NOT NULL operators (or CASE and COALESCE) to check for NULLs.

  • I didn't test it but is seems this should work and is a bit simpler:

    SELECT s.* FROM Students s

    LEFT JOIN StudentExam se

    ON s.StID=se.StID

    WHERE ISNULL(se.ExamName,'') <> 'SQL Server'

    This way any NULLs would be replaced with '' and therefore would not include 'SQL Server'.

    Eva

  • Thanks Eva,

    All of students appear again! You have replaced the NULL with new value, but the logic is the same!

  • If you are running SQL 2005 add the index below and see what plan the engine picks...

     

    create

    nonclustered index IX_Exam_name on StudentExam(ExamName) include(StID)

    SELECT

    s.* FROM Students s

    JOIN StudentExam se

    ON s.StID=se.StID

    WHERE se.ExamName<>'SQL Server'

     

    Here's the plan with the index for the original != query...

    SELECT s.* FROM Students s   JOIN StudentExam se   ON s.StID=se.StID    WHERE se.ExamName<>'SQL Server'

      |--Nested Loops(Inner Join, OUTER REFERENCES[se].[StID]))

           |--Index Seek(OBJECT[TestingDB].[dbo].[StudentExam].[ix_Exam_name] AS [se]), SEEK[se].[ExamName] < 'SQL Server' OR [se].[ExamName] > 'SQL Server') ORDERED FORWARD)

           |--Clustered Index Seek(OBJECT[TestingDB].[dbo].[Students].[PK__Students__345EC57D] AS ), SEEK.[StID]=[TestingDB].[dbo].[StudentExam].[StID] as [se].[StID]) ORDERED FORWARD)

  • Can anyone explain, in detail, why the outer join solution is faster than the Subquery solution in this example?  Is the Exists method fastest of all?

  • Sorry, let me rephrase the question:

    Can anyone explain, in detail, why the outer join solution is faster than the Subquery solution in this example?  Is the Exists method fastest of all?

    I've examined all three on a limited number of rows and the performance and the plans are identical.  Why do you reckon one will be faster than the next with table growth?

    Correct me if I'm wrong, but when you do a left outer join the left part of the join has a full table scan.  How is this faster than a subquery based on an index compared to another column based upon an index?  Also, I can't see the difference in using EXISTS over a standard subquery as it still is effectively comparing values inside the EXISTS subquery?

    Also the LEFT JOIN thing looks nice, but how would you work it if (as it would seem a more reasonable requirement) you wanted the people who hadn't taken the SQL Server exam BUT HAD TAKEN another?

  • Can anyone explain, in detail, why the outer join solution is faster than the Subquery solution in this example?  Is the Exists method fastest of all?

    It depends...  One could be faster than another in different situations.  It all boils down to which query plan SQL Server decides to use for a given query and a given set of data.  As a side not, on SQL 2000 EXISTS tends to be faster than the IN operator.  The explanation I've seen for this is that the IN operator needs to serialize the entire subquery before it can perform the comparison whereas the EXISTS operator does not.  This behavior may be changed in SQL 2005, although I can't test it myself right now.

    I've examined all three on a limited number of rows and the performance and the plans are identical.  Why do you reckon one will be faster than the next with table growth?

    Basically it's all up to the optimizer.  It takes a lot of factors into consideration, including amount of data, when generating query plans.  Changing the amount of data could change the query plan it chooses.

    Correct me if I'm wrong, but when you do a left outer join the left part of the join has a full table scan.  How is this faster than a subquery based on an index compared to another column based upon an index?  Also, I can't see the difference in using EXISTS over a standard subquery as it still is effectively comparing values inside the EXISTS subquery?

    Not necessarily a "full table scan", if it's properly indexed.  You could end up with an (clustered) index scan, which would be more efficient in most cases.  The rationale against the subquery versus EXISTS might be the same as for EXISTS versus IN.  If SQL Server has to serialize the entire subquery before using it, it could be less efficient than using EXISTS.

Viewing 15 posts - 16 through 30 (of 34 total)

You must be logged in to reply to this topic. Login to reply