CONCAT 1

  • Nick Doyle

    SSC Eights!

    Points: 926

    L' Eomot Inversé (8/22/2013)


    Nick Doyle (8/22/2013)


    I don't agree that it encourages sloppy coding. You need to know how a language works and be a decent coder to not be sloppy. I don't see how providing more user-friendly functions contributes to sloppy coding. It makes non-sloppy coders' lives easier.

    This is much easier/quicker than coalesce or:

    ISNULL(FirstName + ' ', '') + ISNULL(MiddleName + ' ', '') + ISNULL(LastName + ' ', '') + CONVERT(VARCHAR(3), Age)

    I'm definitely happy I no longer have to explicitly convert int to string. 😀

    Having a concat function which requires string arguments so that explicit conversion is not needed in the cases where implicit conversion is possible is indeed a nice feature.

    However, if you want a to write code in accordance with the relational model, you will also want to ensure that NULL can not be interpreted as zero or as a zero length string or as anything else other than "this data is not available, the value is unknown" and it is, as Toreador says, sloppy coding to use NULL for anything else; if there is some data for which absence should be interpreted as a zero length string, then declare it as such: in such a case the table should be declared as

    CREATE TABLE #temp (

    emp_name nvarchar(20) NOT NULL DEFAULT(''),

    emp_middlename nvarchar(20) NOT NULL DEFAULT(''),

    emp_lastname nvarchar(20) NOT NULL DEFAULT(''),

    age int NOT NULL);

    This avoids the misuse of NULL to represent a known value. The introduction of this concat function with its stupid treatment of NULL is just a sop to those who were horrified that SQL Server is about to switch to standard treatment of NULL and either were incapable of thinking straight about how to upgrade their sloppily coded legacy software to cope or too set in their ways and wanted to continue misusing NULL in a sloppy and error prone manner for new development. I firmly believe that MS should not have thrown them that sop, we would have suffered less from buggy software in future if they hadn't.

    edit: I forgot to say "good question, Ron".

    I agree with most of what you're saying, but the world isn't ideal.

    1. Many databases are in existence with sloppy code and misuse of null

    2. People still have to maintain these

    This function makes it easier to work with such databases in many cases, such as ad-hoc querying / analysis where you wont have to take as much time concatenating fields that may be null. It takes away some of the penalty you face due to sloppy coding, even for those who aren't responsible for the sloppy coding to begin with. I don't think this function equates to encouraging sloppy coding, but mitigating it's effects.

    I do think it's good that MS is going to the standard and deprecating CONCAT_NULL_YIELDS_NULL OFF.

  • Dave62

    SSCertifiable

    Points: 6468

    I've been building databases since I started working with PFS File in the 1980's. :w00t:

    It seems like there is a literal meaning for NULL, which is "unknown", and a practical meaning, which is "data is not present in this field".

    If I declare a CreateDate field as: [CreateDate] [datetime] NULL

    When I select from that table and see NULL in some of the columns for the CreateDate field, I don't start thinking "Gee, I wonder what's in those fields because it's unknown". I know perfectly well that there is no data present because I allowed the record to be saved without requiring data in that field.

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    Dave62 (8/22/2013)


    I've been building databases since I started working with PFS File in the 1980's. :w00t:

    It seems like there is a literal meaning for NULL, which is "unknown", and a practical meaning, which is "data is not present in this field".

    This is actually not correct.

    The literal meaning of NULL is not "unknown", it is "data is not present in this field". Or, to quote the definitions section of the ANSI standard for SQL:

    "3.1.1.12 null value

    special value that is used to indicate the absence of any data value"

    (Older versions of the standard used a slightly different wording: "r) null value (null): A special value, or mark, that is used to indicate the absence of any data value". I liked that wording better, because it presents "mark" as an alternative to "value". I don't like how any version of the ANSI standard insists on using the term "null value" instead of just "null" or "null mark", since by its very own definition, NULL denotes the absence of a data value).

    This distinction is very relevant. If you have a table with product prices, including columns ValidFrom and ValidTo, a NULL mark in the ValidTo column does not mean that the end date of that price is unknown. It means it is not applicable, because the price is still valid. (Okay, you could argue that it is not yet known).

    If you have a table with customers and they can be either legal bodies or natural persons, then every row will have a NULL in either the Birthdate column or the FoundingDate column. That does not mean those values are unknown. The birthdate of Microsoft Inc. is unknown because it is not applicable - companies have no birthdates. Likewise, the FoundingDate of Hugo Kornelis is NULL, not because it's unknown but because I was born, not founded. (Ohh, this is SO going to attract witty remarks...)

    Finally, in the table that represents my friends, the birthdate for Jane is NULL - and in this case, it is because it IS indeed unknown. I am pretty sure that Jane has been born, and that must have happened on some date - sho just close not to tell me her age, so I am stuck with not knowing her birthdate,

    For each of these examples, when you look at the data in the table, you know what the NULL represents - because you know the context. But taken out of the context, just by itself, a NULL can mean any of those things, and a whole bunch of other stuff. And unless you like unwarranted speculation, you best steer clear of all those possible interpretations and stick to the one thing you do know - that NULL represents the absence of any data value.

    Note that the above argument, about different things that can be represented by NULL, has been used in a famous argument ("much ado about nothing") between Date and Codd. Date, well known for his aversion of NULLs, used this argument to convince Codd that there should be not one but at least two types of NULLs, and when Codd admitted to that Date went on to prove that this resulted in the need to upgrade three-valued logic to four-valued logic, with all the accompanying consequences. There have then also been publications that went on to expose even more types of NULLs - I believe one even went on to 17 types! All this was intended to, eventually, show that avoiding NULL is the only sensible way to do.

    But what Codd apparently never got (and I'm not sure about Date) is that the whole argument is based on a fallacy. To show this, let's replace NULL with 42. When presented without context, you can make all kind of assumptiions about this number. It can represent all kinds of stuff. Maybe it's a counter (42 children in a school bus). Or a measurement of elapsed time (42 minutes spent on typing this reply - okay, I'll admit that it's actually less). It could be just a meaningless number chosen to represent something else (product code 14 - for a small package of grey staples). Or it could be the answer to a very important question. So do we need to implement different kinds of 42 in the database, just to be able to distinguish all these possible interpretations? No, of course not! When you look at 42 in its context, you will know the meaning from that same context. If you see 42 in the ProductNumber column of your OrderLine table, you know that it represents that box of staples, not a time measurement. And when you then see that same number 42 in the NumberOrdered column of the same table, you know that it's a counter of the number of those boxes some customer is ordering. In the context, the meaning of any "42", like the meaning of any "NULL" is clear. Without context, the only safe thing is to reduce it to the minimum guaranteed meaning - for 42, that is "the decimal number exactly between 41 and 43"; for NULL, that is "the absence of any data value".

    But what if we have a table where a NULL can represent multiple things? For instance, the customers table can have NULL in the birthdate column for Microsoft Inc (because that customer was founded, not born) as well as for Ms Davies (because she didn't want to disclose her age) - surely, that would be a great example for why Date was right and we do need multiple types of NULL, right? Again, no. Wrong. And again, I'll use 42 to prove it. Let's imagine a table with data about clothing - women's blouses and sweaters to be precise. One of the columns is called "Size", and I see the value "42" in one of the rows. What does it tell me? Yup, size 42 - but there are three different systems for sizes of women's blouses and sweaters (EU, UK, and US); all of them include a size "42" - and all of them represent a different size. (See http://en.wikipedia.org/wiki/Clothing_sizes). Now there are two possibilities. Either the company using the data knows what they mean with this 42. Maybe because they are strictly UK based and hence use only UK clothing sizes. In that case, it is clear what this 42 means. Just as the NULL in the original example, where only companies could have a NULL birthdate. And then there is the other possibility - that the company does use different size systems interchangably. Now the value "42" in this column is suddenly a lot less informative. like the NULL in the birthdate column that was used for both Microsoft and Ms Davies. Problem? Well ... maybe. It depends.

    If the company using this table is a transport company, and the only reason for storing this information is to have a double check that the information on the box that is delivered matches the information on the delivery report, they are still good. They don;t care about the actual size of the blouse that's in the box, they just want to make sure thaht Ms Davies gets the correct package. So for this case, where the distinction between size 42 (EU) or size 42 (US) is irrelevant for the owner of the data, just "42" will still do. The same might be the case for the NULL (n/a) or NULL (unknown) example - neither customer will get a birthday present, and that's the only thing I use the column for.

    This leaves us with just one possible case. The company DOES care about the actual size of the blouse. They do NOT want to recommend a size 42 (EU) blouse to a customer who ordered a size 42 (US) blouse. So ... did we now finally find a case for needing several different types of 42 in the database? No ... we still didn't. We might want to feel a deep desire to store 42 (EU) or 42 (US) in the same column, and we might want to do that with two different types of 42 - but if we did, we would be violating First Normal Form. One of the requirements of this most basic normal form is that the domain of each attribute has to be atomic. A domain that includes both 42 (US) and 42 (EU) is not atomic. The attribute should be split. Or, in database terms, we should use two columns instead of one - a column to represent the size of a blouse (42), and a second column to represent the size system used for this particular blouse (EU).

    Cycling back to NULL - if the company owning the data has an interest in knowing why the birthdate for a specific customer is NULL (because, apparently, they want the response of their system to an unknown birthdate to be different from the response to a not applicable birthdate), they should indeed ensure that this is stored in the database. But not by inventing two tpes of NULL - that would violate First Normal Form. Instead, they should create a second column - one that is only populated for rows with a NULL birthdate, and that contains an atomic value representing the reason why there is no data value in the birthdate column.

    Oh boy, I guess I got sidetracked a little. Sorry for that. I'll wrap up now - not only because, by now, the time spent typing this IS indeed approaching the 42 minute mark, but also because my cats are hungry, and demanding to be fed. And when the cats demand, I obey, :hehe:


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Dave62

    SSCertifiable

    Points: 6468

    Thanks Hugo. I'm glad to see that my practical use of NULL is the same as the ANSI standard.

    Apparently, there are some bright minds out there who are in the Date camp and feel very strongly that NULL means unknown or any number of things. Even though, in reality, it is very well known that it simply means no data is present.

    I can see why this creates confusion with concatenation. People in the "unknown" camp will have difficulty concatenating an unknown value with a known value. In contrast, it should be very easy to concatenate "no data present" with any known value. The result is simply the known value.

  • TomThomson

    SSC Guru

    Points: 104772

    Dave62 (8/22/2013)


    I've been building databases since I started working with PFS File in the 1980's. :w00t:

    It seems like there is a literal meaning for NULL, which is "unknown", and a practical meaning, which is "data is not present in this field".

    If I declare a CreateDate field as: [CreateDate] [datetime] NULL

    When I select from that table and see NULL in some of the columns for the CreateDate field, I don't start thinking "Gee, I wonder what's in those fields because it's unknown". I know perfectly well that there is no data present because I allowed the record to be saved without requiring data in that field.

    I think anyone is putting a somewhat weird interpretation on "unknown" if they imagine it's possible to ask what value is in a field that is unknown; if a value is unknown, it can't be in the database; if a field doesn't have a value in it. then from the point of view of the database the value for that attribute of the entity represented by the row containing the empty field is not know, so it is unknown. So unknown and not present amount to the same thing. I can see from your comment that you understand that; I've also observed over the years than an awfully large number of people don't.

    Of course you may want to know why it's unknown: that's easy, roughly as Hugo pointed out: put the reason in a separate column. The data is still absent; the value is still not known; but now you know why it is absent and not known.

    Hugo's example with birth date and foundation date is fair enough in one sense, but not in another sense; if we look at "legal person" and "physical person" they have far more attribute differences than just "birth date" versus "foundation date", so rather than having hordes of NULLable columns you may save space and make life easier by having a person table and two extra tables, one for the legal person only attributes and another for the physical person only attributes. Of course even if you do that, you may not have birth date in one of those and foundation date in the other, but a single column (called something like start date or inception date or beginning date) in the person table. That doesn't mean that what he's saying is wrong, just that he's perhaps chosen something which isn't the best example.

    Tom

  • SQLWinther

    SSCertifiable

    Points: 5886

    +1 Thank you for this easy one. I've been educating people about the news in SQL server 2012 and I really like this function. Not becase the possibility to concat null values, but to be able to concatenate strings with numeric values. Of course, concatenating strings are often used for presentation and that should be handled in the presentation layer and not in the database.

    /Håkan Winther
    MCITP:Database Developer 2008
    MCTS: SQL Server 2008, Implementation and Maintenance
    MCSE: Data Platform

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    Dave62 (8/22/2013)


    I can see why this creates confusion with concatenation. People in the "unknown" camp will have difficulty concatenating an unknown value with a known value. In contrast, it should be very easy to concatenate "no data present" with any known value. The result is simply the known value.

    I am sorry, but I (again) have to disagree.

    Let's use an analogy. You know my first name: Hugo. You also know my last name: Kornelis. Now imagine the next wanna-be Don Corleone breaking into your house, pointing the biggest gun you've ever seen to your forehead, and daring you to utter my full name - get it wrong, he pulls the trigger. What would you do? (Apart from breaking down and crying, and probably wetting youyself, that is) I imagine you'd say that you can do that, if he is willing to first disclose my middle name (if any). His response is silence. After a while, you gather all your courage and asks again - and now Mr Wannabe Corleone tells you that he heard the question, but deliberately refused to answer. (See what I did there? His not answering your question about my middle name represents NULL - you asked for a data value (my middle name), you got nothing (the absence of a data value - NULL).

    The big guy is still standing there, gun still pointed at your head. You're trying to think of a plan. Than, just as the guy clears his throat and repeats the question about my full name, the proverbial lightbulb above your head switches on. If he can do it, then so can you, right? So you, politely (of course - he's got a gun) inform the man that you heard the question - but deliberately refuse to answer. He pockets his guns, pats you on the shoulder (you cringe - even with the gun gone, he's still at least twice as big and thrice as strong as you), smiles and compliments you: "smart move - I never demanded that you HAD to answer". And then he leaves your room.

    Congratulations! You not only escaped a life-threatening situation that is so absurd that even the Monty Pythons would have ditched the idea, you also did what any proper DBMS should do when asked to concatenate a string to a missing value. Analog to the "garbage in garbage out" principle, you used the "nothing in, nothing out" principle. Also known as NULL propagation. And the reason that this works is that in a database, NULL is a valid "value" (I still prefer "marker") for every data type. So whatever the datatype of the return expression is, the database always has the option of not supplying an answer, if an input that it needs to ensure the answer is valid is missing. The database should not risk returning a value that could be incorrect - and since any assumption about missing values can be incorrect, the database should not return a value when the expression involves a NULL input.

    So where does all the confusion with "unknown" come from? (To my surprise, I see that even Tom, a man I normally agree with on most points, is confused by this. Understandable!)

    Let's stretch the already weird analogy even further. Next day, Mr Corleone returns. Again with that impressive looking and probably extremely dangerous gun. But with a different question. "Quick", he bellows, "you need to tell me if Hugo has 'John' as his middle name. You know the drill, answer wrong and I'll pull this trigger. And don't try that trick you pulled yesterday, this time I do want an answer - no answer, I also pull the trigger". Ouch! This gets scary. While spoiling yet another perfect pair of trousers, you quickly go through your options. You can take a gamble. Statistically, the chance of me, or anyone, having 'John' as middle name is below 50% - so the answer "no" gives you a better chance of surviving than the answer "yes". But it's your life at stake. I think 10% chance of losing is too much when gambling a fistful of dollars, for my life, I need more, much more margin than that!

    You keep pondering your options until Mr Big gets impatient and reminds you that no answer means death as well. Then, you break down.You cry like a baby, and between your tears you whimper "please, please, please don't shoot me. please spare me. You're asking the wrong man. I don't know. That Dutch idiot never told me his middle name, you didn't tell me, how can I know if it's John? I don't know, okay? Please don't shoot me!".

    He grimaces, pockets his gun, and pats your shoulder (you notice that you start to get used to this part). "You did it again. You gave me an answer. I never told you that only 'yes' or 'no' are valid answers. 'I don't know' is just as valid - and absolutely right. Congratulations, you live. Now go take a shower and find some clean clothes, you stink".

    This analogy represents a database evaulating a query with a WHERE clause. The clause here being WHERE MiddleName = 'John'. When coming across my data, the database does not have the data it needs to evaluate the question. But unlike before (when the problem was in the SELECT clause), the database now does not have the option to simply not answer at all. The query processor needs to decide whether or not to include this row, and it can't simply sit there and wait for someone to come tell it my middle name. The only options appear to be yes (true) or no (false). An incorrect answer is not an option - lifes could be at stake (unlikely with a query on middle name, much more likely with a query on, e.g., allergies - but work with me, okay?). What should the query processor do? And then -bingo!- you came along and supplied (with a bit of help from my maffia friend) the solution. A third option - "I don't know" - aka the truth value "unknown". Yes, there it is, finally, the U-word!

    Instead of the traditional boolean logic (based on the two truth values true and false), we now have defined a logic system that uses a third truth value - and is therefor called three-valued logic (3VL). In 3VL, the three possible truth values are true, unknown, and false. The logic operators AND, OR, and NOT operate as one would intuitively expect (e.g. true AND unknown yields unknown, but true OR unknown yields true - if I ask you "is my first name Hugo AND my middle name John", you wouldn't know, but if I ask you "is my first name Hugo OR my middle name John", you'd say yes).

    There's only one thing left to do. The poor query processor is still facing a problem. Okay, it now did get an answer when trying to evaluate "WHERE MiddleName = 'John'" - but that answer is unknown, so should the row be returned or not? To handle that issue, the designers of the relational model simply made a judgement call. They agreed among each other that rows would only be returned if the predicate evaluates to true; both [false and unknown would result in the row being rejected. They could just as well have decided to handle unknown differently (i.e., to include it in the results_ - but they didn't. Deal with it.

    So, bottom line:

    * NULL in a column means "no data value here". It does not mean "Unknown". In a specific context, it could mean "unknown" (just as in a specific context, 42 could mean "just large enough for only the smallest women") - but that is context. NULL itself means only "no data value here".

    * When using NULL in expressions (other than those specifically designed to handle NULLs), the result has to be NULL - you don't want to utter "Hugo Kornelis" as my full name if that could result in a bullet being inserted in your brain in case it's incorrect.

    * When using NULL in predicates (comparisons) (other than those specifically designed to handle NULLs), the result of the predicate is not a traditional boolean value, but the three-valued truth value ]unknown.

    * Please, please, please do not confuse the evaluation of a condition on a NULL value to unknown with the value itself being or representing "unknown". If you do, I'll have no other option but to send that very tall (and very armed) Corleone type your way yet another time (we still have your address). That would be the third time - and as we all know: third time, you're out.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Dave62

    SSCertifiable

    Points: 6468

    That's ok Hugo. I don't mind if you disagree because that was a very entertaining story. :laugh:

    But apparently whoever added the new CONCAT function to SQL Server 2012 is in agreement with my interpretation because the result is the known values.

    Enjoy!

  • TomThomson

    SSC Guru

    Points: 104772

    Hugo Kornelis (8/22/2013)


    So where does all the confusion with "unknown" come from? (To my surprise, I see that even Tom, a man I normally agree with on most points, is confused by this. Understandable!)

    I'm not at all confused by this, but I have the impression that you are.

    Let's have a database of the truth values of certain propositions at about midday each day for which a test is scheduled for the corresponding proposition. One such proposition may be "XYZ Ordinary Shares have a higher value in the London stock market than ABC Ordinary Shares"; lets call that proposition "Prop1"; we have a table created like this:

    CREATE TABLE Props(

    Propname varchar(16) NOT NULL,

    Testdate date NOT NULL,

    tv varchar(8) NULL check(tv IS NULL or TV in ('True', 'False' ,'Unknown'),

    constraint PK_Props primary key (Testdate, Propname)

    );

    Now if we run the query

    SELECT tv FROM Props WHERE Propname = 'Prop1' and date = 20130821'';

    there are five things it can do, three of which involve returning a value from the domain varchar(0) NOT NULL check(tv IS NULL or TV in ('True', 'False' ,'Unknown'), the fourth returns the additional value in the extended domain varchar(8) NULL check(tv IS NULL or TV in ('True', 'False' ,'Unknown'), and the fifth returns nothing at all.

    1) return 'True'

    2) return 'False'

    3) return 'Unknown' because when that proposition was tested on that date one (or both) one (or both) of the values was not present in the database

    4) return NULL because the record with that primary key has NULL in the tv column, perhaps because the test was not, for some reason, carried out that day although it was scheduled.

    5) return nothing because no row has the primary key specified in the where clause (that proposition was not scheduled to be tested that day).

    In case 4, the database does not know what the truth value is; it's not known. However, it is NOT 'Unknown'.

    Your exposition further on in the message containing the above quotation claims to prove that NULL can't mean unknown because it isn't the truth value "Unknown". That is just a confusion of two very different concepts, an unjustifiable conflation of data with metadata. The truth value unknown is not a possible value of any attribute of an entity unless the domain of that attribute includes the truth values of a non-classical logic which includes unknown; most domains used in databases don't include those values, so it is very unusual for that truth value to be a possible result and even when it is possible NULL does not indicate that truth value. However, it is certainly possible that the real world value of some attribute is not known, and this is indicated by marking its database representation NULL. And if a value in the database is marked NULL, it is quite clear that so far as the database is concerned the value of the real world attribute isn't known - and "it isn't known" is just another way of saying "it is unknown".

    I suspect that your attempt to say that a NULL mark doesn't mean the value is unknown arises from a desire to stop people confusing the truth value unknown with the meaning of the NULL mark - but unless the domain of the value includes that truth value, nothing in the domain can be that truth value; the database representation of an attribute as NULL certainly does mean that so far as the database is concerned we don't know what value the attribute has, because we haven't recorded it. I believe it would be easier to stop people confusing the truth value and the significance of the mark if people stopped claiming that the mark doesn't imply a lack of knowledge (in the database) of the value of the real-world attribute that it marks, and explained instead the difference between the truth value's significance and the mark's significance without making such claims.

    Incidentally, I reckon Date's "proof" that going to 2 sorts of NULL forced you to 3, 4 and so on ad infinitum was complete hogwash, so that appears to be something else we disagree on; but I think Codd was wrong in advocating two different NULLs in some of his papers, so that is something we agree on.

    Tom

  • tilew-948340

    Hall of Fame

    Points: 3431

    Nice reading you guys!

    Hugo, you should write a book!

    Thanks Ron, that was a nice and an entertaining question!

  • whereisSQL?

    SSCarpal Tunnel

    Points: 4898

    Love the story.. I never knew database concepts could be so entertaining..

    Have you written any books yet? I would love to read it!

    edit - lol you beat me to it

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    L' Eomot Inversé (8/22/2013)


    Hugo Kornelis (8/22/2013)


    So where does all the confusion with "unknown" come from? (To my surprise, I see that even Tom, a man I normally agree with on most points, is confused by this. Understandable!)

    I'm not at all confused by this, but I have the impression that you are.

    Now why am I not surprised? 😉

    Thanks for your well thought out reply and the interesting example. I'll respond to a few points.

    (...)

    4) return NULL because the record with that primary key has NULL in the tv column, perhaps because the test was not, for some reason, carried out that day although it was scheduled.

    (...)

    In case 4, the database does not know what the truth value is; it's not known. However, it is NOT 'Unknown'.

    That's three statements in a single line, and they are true, false, and true. 🙂

    - "the database does not know what the truth value is" - true, the database does not know what it is, because it is missing. I prefer to stress the "missing" part and leave the "the database doesn't know part" to the imagination of the reader, because that tends to cause confusion.

    - "it's not known" - false, you can reword "the database does not know what the truth value is" to "it's not known to the database", but you cannot simplfy that to "it's not known". It is missing, and I have no idea why. I could speculate about the reason why it's missing - but any such speculation would include knowledge of the context where this NULL is found. I'll expand on this "context" idea later.

    - "However, it is NOT 'Unknown'." - obviously true, and I don't think anyone here needs an explanation.

    Your exposition further on in the message containing the above quotation claims to prove that NULL can't mean unknown because it isn't the truth value "Unknown".

    No, you are misunderstanding me. I never said that NULL can't mean unknown. This is where that "context" thing comes in.

    Would you say that the number 42 can't mean "just the right size for large women"? I hope not, because you would be wrong - it can mean exactly that - within a specific context (US standard confection sizes for women's blouses). But without the context, you do not know whether that specific 42 does actually mean that. It might, or it might mean something completely else. So outside of all context, you would not say "42 means 'just the right size for large women'", nor "42 can't mean 'just the right size for large women'"; you would say "42 might mean 'just the right size for large women', or it might mean something else, depending on the context where it's used".

    So I don't say that NULL can't mean unknown. I only say that NULL might mean unknown, or it might mean something else, depending on the context where it's used.

    However, it is certainly possible that the real world value of some attribute is not known, and this is indicated by marking its database representation NULL.

    Oh yeah, definitely. Just as it is also certainly possible that there is an attribute is not applicable, and this is indicated by NULL (birthdate of a company). Or that the real world value of an attribute is "indefinitely", and this in indicated by NULL (ValidTo date of a price). Or that the real world value is "no matching row found", and this is indicated by a NULL (result of an outer join). Or that the real world value is "you have no access to this data", and this is indicated by a NULL (possible result of some types of row-level security implementation). NULL can mean all those things, and many more.

    And if a value in the database is marked NULL, it is quite clear that so far as the database is concerned the value of the real world attribute isn't known - and "it isn't known" is just another way of saying "it is unknown".

    Now you're doing the same as you did in the start - simplyfying "it isn't known to the database" to "it isn't known". You are far from alone in that confusion (it is, in fact, one of the two major reasons why so many people insist that NULL means unknown; the confusion with the truth value unknown being the other major reason).

    The value is missing. There is no data value at that point in the database. So yeah, you are right that if the database has to do something with that value, it doesn't know what value to use - so I'll accept "unknown to the database" - even though that formulation is risky, because it tends to lead to confusion. But simplifying that to just "unknown" is NOT correct - that is exactly the confusion I am so wary of.

    Now one important point I do need to add is that all the queries we write operate in a context. So the context I so vehemently stripped off in this message is very much present in the queries, stored procedures, and other database code. That is true for 42, and it is true for NULL. If there are two orders for 42 units of the same article, our query logic will add them up and request the manufacturer to send us 84 units. But if there are two orders for a size 42 shirt, the query will NOT add those two numbers together and send a request for a size 84 shirt to the factory. When writing the query, we know that the 42 in one column is a quantity that can be added, and the 42 in the other quantity is a confection size unit that does not lend itself to arithmetic.

    And again, it's the same for NULL. When I write "WHERE ValidFrom >= @OrderDate AND (ValidTo <= @OrderDate OR ValidTo IS NULL)", I'm not querying for prizes with an unknown ValidTo date, but for prizes that are "indefinitely" valid. When I wrote "FROM OrderDetails AS od LEFT OUTER JOIN Orders AS o ON o.OrderID = od.OrderID WHERE o.OrderID IS NULL", I am not trying to find details of orders with an unknown OrderID value; I'm looking for orphaned OrderDetails (though I would in reality code that in a different way - just sayin'). And when I write "WHERE is_smoker IS NULL", I am indeed looking for people for whom we do not know whether they smoke or not. All because I know the context that applies to the NULL values in the rows and columns I am querying.

    The context-less meaning of NULL (or of 42) is relevant for how the database handles such values. The result of an expression should always be the same, regardless of context. When I add two values of 42, the result will always be 84 - even when I am foolish enough to do this to values from the "ShirtSize" column. The database engine does not know the context and provides a well-defined treatment of the value 42 stripped of any context - a treatment that works well in most contexts, but might need special treatment in specific contexts.

    Similarly, when I concatenate 'Hugo' and NULL, the result will always be NULL - even when I am foolish enough to do this in a context where I know that NULL means 'is a nutter'. The database engine does not know the context and provides a well-defined treatment of the NULL marker stripped of any context - a treatment that works well in most contexts, but might need special treatment in specific contexts.

    I suspect that your attempt to say that a NULL mark doesn't mean the value is unknown arises from a desire to stop people confusing the truth value unknown with the meaning of the NULL mark

    No, it's way more than just that. The false notion that NULL means unknown simply causes too much trouble. I've seen people invent their own methods to handle missing values that were not unknown but missing for another reason, because they are convinced that NULL means unknown. I have seen people expect logic on NULL to return values that would be appropriate if NULL actually means unknown, and then complain that NULLs are not handled correctly and should be avoided. And I've seen people that simply struggle with how NULLs are handled and how 3VL works. I am pretty sure that Dave will no longer suffer from that - he only needs to think back of that night when my Uncle Luigi (sorry, Italian readers - don't blame me for the stereotype, blame Hollywood) disrupted his peace twice in a single evening and he'll immediately see the logic of how NULL is handled. (Well, mostly - not all NULL handling is logical, for some situations the designers of the SQL standard simply had to make a choice and no logic could favor one choice over the other).

    Incidentally, I reckon Date's "proof" that going to 2 sorts of NULL forced you to 3, 4 and so on ad infinitum was complete hogwash, so that appears to be something else we disagree on; but I think Codd was wrong in advocating two different NULLs in some of his papers, so that is something we agree on.

    Just for the record, I may not have called Date's argument "hogwash", but I did call it a fallacy - so apart from you using a stronger word than me, we do not disagree on that. And as far as I know, the only reason Codd embraced the notion of two types of NULL is because he fell for Date's fallacy. I am convinced that if he had lived longer, he would eventually have seen the fallacy and retracted his writing that include the two NULL types. SO we agree on that too.

    (Oh, one more thing - the example af the start of your post painfully reminded me of the fact that the current version of the ANSI standard for SQL actually defines a data type for truth values. They call it boolean (shudder!), and even explicitly specify that the truth value "unknown" and the situation where a truth value is missing should both be represented by the null mark (shudder even more!). I am sure that there are situations where this can be used in a very practical way (just as NULL_CONCAT_YIELDS_NULL can be practical in some situations) - but from the point of relational theory, it just makes me want to cry).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    Dave62 (8/22/2013)


    But apparently whoever added the new CONCAT function to SQL Server 2012 is in agreement with my interpretation because the result is the known values.

    I think it's more like one Microsoft employee ripping out CONCAT_NULL_YIELDS_NULL because it violates all relational theory, and the guy at the next desk putting in CONCAT because it's easier for some cases.

    I've had to work with counters that corresponded to C# zero-based arrays. So "3" represents four elements, and if you combine two of those arrays, you have "7" (representing eight) elements. For that specific use case, it would have been practical if 3 + 3 would evaluate to 7. If someone would redefine addition to work that way, it would save me the effort of having to explicitly write "+1" to my additions. Okay, it may fly in the face of all mathematical theory - but it is SO PRACTICAL (in that specific situation).

    Does the request to redefine addition this way sound absurd to you? Well, the request to redefine concatenation of NULL values is exactly like that.

    tilew-948340 (8/22/2013)


    Hugo, you should write a book!

    Already done. Okay, not a whole book. But I did contribute a chapter for Adam Machanics "Expert SQL Server 2005 Development", two chapters for "SQL Server MVP Deep Dives, volume 1", and one chapter for "SQL Server MVP Deep Dives volume 2".

    (I guess I should invest the time to add links to these books on Amazon.com, but I should really start on my work now")


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • aahkam

    SSC Enthusiast

    Points: 107

    I was caught by thinking CONCAT_NULL_YIELDS_NULL would impact the CONCAT and cast my answer, apparently it is not impacted.

  • TomThomson

    SSC Guru

    Points: 104772

    Hugo Kornelis (8/23/2013)


    Now you're doing the same as you did in the start - simplyfying "it isn't known to the database" to "it isn't known". You are far from alone in that confusion

    ....

    ....

    And when I write "WHERE is_smoker IS NULL", I am indeed looking for people for whom we do not know whether they smoke or not.

    And there you fall into exactly the simplification that I use. So if it is indeed a confusion, I have you for company.

    It's all very well to talk about the context; but often when you come to look at some SQL you don't have that context - you just have the code, including if you are lucky the scripts that generate the schema and all its constraints, and of course you have the data. Given the general behaviour of developers, if you have any documentation of the code and/or the schema, the documentation will describe a version that never worked, in fact never even succeeded in its unit tests, not the version that finally emerged from development through QA, so any context the documentation gives you is as likely to mislead as to guide you to the correct conclusion. Yes, people like you and me will often provide decent documentation; but the people who maintain and enhance our code probably won't. So the only reasonable view of the SQL language is that when we say "this is unknown" or "that is known" we are talking about the database content, not about something that we individually know because that's the way we designed it - except in the rare case that the system we are looking at is our own design and has not been altered so that the context we believe in is no longer the actual context in which the database and the data in it exist.

    Yes, in an ideal world we would always know the context - for example table definitions with nullable columns would be commented to explain why the column is nullable, table definitions for child or auxiliary tables would be commented to say why a key in the parent or main table would not be referenced from any row in the child or auxiliary, and we know exactly the significance of every null. But in the real world all we can safely say is that NULL means the data isn't present, so ithe value of the data is unknown in the database, and lacking the non-available higher level context to say why the value is unknown in the database it's reasonable when talking about the database and the data in it, since when talking about "something" we are talking in the context of that something, to cut the excess verbiage and just say "the value is unknown".

    edit: I think I should also comment on this:

    (Oh, one more thing - the example af the start of your post painfully reminded me of the fact that the current version of the ANSI standard for SQL actually defines a data type for truth values. They call it boolean (shudder!), and even explicitly specify that the truth value "unknown" and the situation where a truth value is missing should both be represented by the null mark (shudder even more!).

    I agree with your first shudder. Boolean logic is not 3-valued, so calling this type boolean is just plain silly. Your "shudder even more" I can't agree with - what else should one do when the value of a so-called Boolean is not known in the database? Apart from the name, there's nothing wrong with this type.

    Of course the name misuse just followed a long tradition of screwing up; back in the early days, there were some Fortran compilers which provided a bool type with 64 possible values, and other Fortran compilers that provided the same type with 256 values; you probably know that these were single character values on machines with 6 bit chararacters or 8 bit characters; usually the Boolean operators operated bitwise, so that an array of bools was a viable way to represent a string that you wanted to manipulate as opposed just to have a literal string that you wanted to output (the only viable way, in those early versions of Fortran). But the logical if statement only recognised two vales, and some compilers had 0 for false and non-zero for true, at least one had 255 for true and any other value for false, in another even numbers were false and off true - a complete nightmare. ANS Fortran IV cleared up that mess, but got much else wrong.

    Tom

Viewing 15 posts - 16 through 30 (of 39 total)

You must be logged in to reply to this topic. Login to reply