String Length

  • SQL Kiwi (7/26/2012)


    L' Eomot Inversé (7/25/2012)


    Hugo Kornelis (7/11/2012)


    In my opinion, it is LEN() that has the strange behaviour here, because it returns the length of the padded version of the string instead of the length of the string itself. This almost caused me to give the wrong answer - but I recalled just in time the number of times I have already been bitten by this strange behaviour of LEN().

    I don't understand that, Hugo. LEN isn't doing anything with padding except ignoring it. It's the concatenate operation that turns the padding from padding to not padding by putting non-space characters beyond the padding.

    +1 I'm not sure what Hugo meant here either, but I agree with you Tom.

    I would have expected LEN() to return the length of the string. The fact that it ignores whitespace, but only on one end of the string, seems counter-intuitive to it's intended purpose. RTRIM() exists to remove trailing white space. Why would LEN() implicitly provide this functionality as well?

  • sestell1 (7/27/2012)


    SQL Kiwi (7/26/2012)


    L' Eomot Inversé (7/25/2012)


    Hugo Kornelis (7/11/2012)


    In my opinion, it is LEN() that has the strange behaviour here, because it returns the length of the padded version of the string instead of the length of the string itself. This almost caused me to give the wrong answer - but I recalled just in time the number of times I have already been bitten by this strange behaviour of LEN().

    I don't understand that, Hugo. LEN isn't doing anything with padding except ignoring it. It's the concatenate operation that turns the padding from padding to not padding by putting non-space characters beyond the padding.

    +1 I'm not sure what Hugo meant here either, but I agree with you Tom.

    I would have expected LEN() to return the length of the string. The fact that it ignores whitespace, but only on one end of the string, seems counter-intuitive to it's intended purpose. RTRIM() exists to remove trailing white space. Why would LEN() implicitly provide this functionality as well?

    LEN is just ensuring that two equal strings have equal length: two strings which are the same apart from the number of trailing spaces compare equal in SQL. It would be crazy for LEN to deliver unequal lengths for two equal strings. Of course you may think that the SQL version of string equality is a bit bizarre, but don't blame LEN for that.

    Tom

  • L' Eomot Inversé (7/27/2012)


    sestell1 (7/27/2012)


    SQL Kiwi (7/26/2012)


    L' Eomot Inversé (7/25/2012)


    Hugo Kornelis (7/11/2012)


    In my opinion, it is LEN() that has the strange behaviour here, because it returns the length of the padded version of the string instead of the length of the string itself. This almost caused me to give the wrong answer - but I recalled just in time the number of times I have already been bitten by this strange behaviour of LEN().

    I don't understand that, Hugo. LEN isn't doing anything with padding except ignoring it. It's the concatenate operation that turns the padding from padding to not padding by putting non-space characters beyond the padding.

    +1 I'm not sure what Hugo meant here either, but I agree with you Tom.

    I would have expected LEN() to return the length of the string. The fact that it ignores whitespace, but only on one end of the string, seems counter-intuitive to it's intended purpose. RTRIM() exists to remove trailing white space. Why would LEN() implicitly provide this functionality as well?

    LEN is just ensuring that two equal strings have equal length: two strings which are the same apart from the number of trailing spaces compare equal in SQL. It would be crazy for LEN to deliver unequal lengths for two equal strings. Of course you may think that the SQL version of string equality is a bit bizarre, but don't blame LEN for that.

    Ah, but LEN() is not comparing strings, it's supposedly returning an individual string's length... except it doesn't. It rtrims the string first, then returns the length of the result. In my opinion, the behavior of string comparitors in regard to padding shouldn't have anything to do with the LEN() function returning the actual length of a string, regardless of the characters the string is composed of.

    Interestingly when two strings are compared (except with LIKE), the strings are padded out to the same length before comparing, not trimmed.

  • sestell1 (7/27/2012)


    Ah, but LEN() is not comparing strings, it's supposedly returning an individual string's length... except it doesn't. It rtrims the string first, then returns the length of the result. In my opinion, the behavior of string comparitors in regard to padding shouldn't have anything to do with the LEN() function returning the actual length of a string, regardless of the characters the string is composed of.

    Well, you could take that view. Or you can get mathematical about it and say that if the predicate "=" is defined on a type then any function on that type must return equal values when applied to equal arguments, and LEN is a function. Or you could go back to the original concepts that the varchar type is trying to model - a written (or printed) string: if you have a sheet of paper (let's say A4 size, portrait orientation) with "ABCDEFG" printed in 12 point Roman type with the bottom left point of the "A" placed three inches from the top left corner on the line 60 degrees below the to edge and you look at that piece of paper, and another piece of paper, same size and shape, with the string "ABCDEFG " printed on it in the same typeface with the leftmost point of the A in the same position on the page, do you have two pieces of paper with same string on them or two pieces of paper with different strings on them? Most people, if shown those two pieces of paper, will say they both have the same string.

    Interestingly when two strings are compared (except with LIKE), the strings are padded out to the same length before comparing, not trimmed.

    Do you think that's a description of what actually happens in the implementation of some particular langiage (perhaps T-SQL) instead of a simple way of describing the result of something rather different that is really what actually happens and achieves the same end result (except that it performs reasonably)? I suspect that if you look at what goes on in the engine you'll find that when T-SQL wants to compare the varchar(max) value "A" with datalength 1 (datalength is a function on storage objects representing strings, not on strings) and len 1 (len is the function on strings) with a varchar(max) value "B" with datalength 1500000001 and len 1 it doesn't actually add one and a half billion characters to the first string and then compare the result with the second string. I haven't looked at the internals myself, so I could be wrong; but if so I might be tempted to offer Microsoft my services as a guide to the production of compilation/interpretation systems for semi-declarative languages! 😀

    Tom

  • L' Eomot Inversé (7/27/2012)


    Or you can get mathematical about it and say that if the predicate "=" is defined on a type then any function on that type must return equal values when applied to equal arguments, and LEN is a function.

    From a mathmatical standpoint, if <string 1> = <string 2>, then the <string 1> + <string 3> = <string 2> + <string 3> should also be true. In the case of a string with trailing spaces however, this is not the case. If you are saying all functions on a string should preserve equality, then SQL Server should also trim all strings before concatenating them.

    Other functions that don't seem to preserve equality with trailing spaces:

    Replace(<string 1>, " ","x") = Replace(<string 2>, " ","x")

    Reverse(<string 1>) = Reverse(<string 2>)

    DataLength(<string 1>) = DataLength(<string 2>)

    It's interesting you should bring up the paper example, as this is exactly why I think LEN() should return the actual length of the string... because whitespace is difficult to observe without additional context. Having LEN() implicitly RTRIM makes whitespace all that much more difficult to detect.

  • Honestly I would guess that LEN works the way it does because the vast majority of the time you don't want to have to worry about the trailing spaces. I know when I use it I usually don't want to see the extra spaces. Probably on the order of 90% of the time. Now that doesn't mean I don't get bitten every now and again on the other 10% but I would rather that than get bitten on the 90%.

    As an example if I have a field ZIPCODE varchar(10) and a row with '12345 ' and another table with '98765' and I'm trying to find any zipcode with more than 5 characters I really don't want to have to worry about the trailing spaces. I could use LTRIM, but I would be adding LTRIM far more frequently than I would be dealing with figuring out if I have an extra space at the end or not.

    Kenneth

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • L' Eomot Inversé (7/25/2012)


    Hugo Kornelis (7/11/2012)


    In my opinion, it is LEN() that has the strange behaviour here, because it returns the length of the padded version of the string instead of the length of the string itself. This almost caused me to give the wrong answer - but I recalled just in time the number of times I have already been bitten by this strange behaviour of LEN().

    I don't understand that, Hugo. LEN isn't doing anything with padding except ignoring it. It's the concatenate operation that turns the padding from padding to not padding by putting non-space characters beyond the padding.

    Sorry for the late reply; I was on holiday. (FWIW - Spain, France, and Italy were all great places to be!)

    I'll try to explain what I mean. The varchar data type (assuming standard options) is designed to preserve trailing spaces. So if I assign 'ABC ' to a varchar column or variable, the three trailing spaces are kept, assuming that they are there for a reason. I like this behaviour. I still have the flexibility to get rid of it by using RTRIM(), but I would normally also assume that if someone takes the trouble to type those three extra spacecs, there's a reason for him/her.

    I don't like some of the choices made (either by the ANSI committee or by MS) on how those trailing spaces are handled. LEN() is one of those points. If we agree that those three trailing spaces are relevant, then why would their length not be relevant? And how can anyone explain to me that LEN(@a) is 3, LEN(@b) is 4, but LEN(@a + @b-2) is 10 instead of 7?

    As others have commented, there may be a connection with the behavior of the equality operator. I dislike that behavior too. If those trailing spaces are significant enough to store them, then why would they not be significant enough to affect equality? Or, reversed, if we truly think that 'ABC ' is equal to 'ABC', why not store them both in the same way?

    To reply to your direct question: "LEN isn't doing anything with padding except ignoring it" - LEN is not ignoring padding (which is the addition of extra spaces to arrive at a certain length); it is ignoring explicitly added and stored whitespace at the end.

    I will concede that there are good arguments for ignoring trailing whitespace, the best being that a human would ignore that trailing whitespace too (because in print, it is invisible). For me personally, I think the arguments in favor of retaining whitespace are stronger. But more to the point - those in favor of ignoring whitespace should advicate not storing it at all, rather than storing it and then disregarding it in some (but not all!) cases.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Nice question!

  • Hugo Kornelis (8/4/2012)


    Sorry for the late reply; I was on holiday. (FWIW - Spain, France, and Italy were all great places to be!)

    No need to apologise for the delay- no-one should expect instant repsonses. Anyway, if everywhere was great you should be gald you were there, not sorry!

    To reply to your direct question: "LEN isn't doing anything with padding except ignoring it" - LEN is not ignoring padding (which is the addition of extra spaces to arrive at a certain length); it is ignoring explicitly added and stored whitespace at the end.

    I don't think that it's any the less padding because a human, rather than some automton, added it. But I guess it depends on what you choose to call "padding" - language tends to be a bit flexible in places.

    I will concede that there are good arguments for ignoring trailing whitespace, the best being that a human would ignore that trailing whitespace too (because in print, it is invisible). For me personally, I think the arguments in favor of retaining whitespace are stronger. But more to the point - those in favor of ignoring whitespace should advicate not storing it at all, rather than storing it and then disregarding it in some (but not all!) cases.

    I think you are arguing for something which has a simple logical description. What we have is instead something which the designers of the standard thought was practical/useful, not necessarily simple, which needs complex (or convoluted) logic to describe it.

    I don't think there is any clear right way of handling these things. On the one hand, the thoroughly simple system in which LEN counts trailing spaces (and DATALEN counts bytes) is would be painful (writing rtrim all over the place - if we are comparing two columns, that destroys our ability to use indexes; maybe everyone could write things so that they never stored trainling spaces?) to use, and switching to it would of course break a lot of existing code. On the other hand, one could fall into all sorts of horrible complexity if we tried to work out a system that made everything we might want to do with strings simple (this is a classical dilemma - trying to make every easy for the user can lead to a complex nightmare of a UI).

    We need some sort of tests that tell us whether two strings are the same under some sameness rules, and those sameness rules have to take care of collation (including collations like for example latin1_general_ci_ai, which lead to equating things which are in some ways very different). Are two strings the same if they both represent the same sentence? If they are, we should be ignoring leading spaces as well as trailing spaces when we test, and even worse, we should be treating internal multiple spaces as single spaces, and internal line feeds and carriage returns as being equivalent to space - with an awkwardness about hyphens of course. The compromise we have (use the collation and ignore trailing spaces) is not perfect, but it's good enough for us to live with.

    At the same time we need some way of handling layout which, in the days before the enormous popularity of html, which is when the SQL conventions date from, meant paying attention to all spaces and carriage returns and line feeds.

    Should the concatenation rules treat padding the same way as the compromise equality rule that has been adopted? If yes, it becomes slightly more difficult (perhaps more fiddly, really) to handle layout. But maybe we should have several different concatenation functions, instead of just one. Should we have a greater variety of comparison operations for strings? Probably yes - if we had an equality operation that didn't ignore trailing spaces it would be a useful addition to the string operations that SQL has (but it wouldn't be a replacement for the current string = operation, just an extra operation - but I would be infavour of using "=" for the new operation and maybe "=~" for the existing one. We would want a few more sameness operations too, perhaps, to create an ideal world: for example do we want operation which delivers false when testing 'a' agains N'a'?

    At least with the length function we have two of at least four required to give a decent set: number of bytes per character is an orthogonal issue to whether trailing spaces count, and SQL has conflated them, leaving us with an incoherent pair.

    Tom

  • hi,

    I am a newbie in filed of DBa. Can anyone plz explain me how the 41,80 coming while executing--

    DECLARE @nvarchar nvarchar(40)

    DECLARE @nchar nchar(40)

    SET @nvarchar = '7 chars'

    SET @nchar = '7 chars'

    SELECT len(@nvarchar), len(@nchar),

    len(@nvarchar+';'), len(@nchar+';'),

    datalength(@nvarchar), datalength(@nchar)

    Correct answer:

    7,7,8,41,14,80

    thanks a lot in advance

    _______________________________________________________________
    To get quick answer follow this link:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • kapil190588 (10/5/2012)


    hi,

    I am a newbie in filed of DBa. Can anyone plz explain me how the 41,80 coming while executing--

    DECLARE @nvarchar nvarchar(40)

    DECLARE @nchar nchar(40)

    SET @nvarchar = '7 chars'

    SET @nchar = '7 chars'

    SELECT len(@nvarchar), len(@nchar),

    len(@nvarchar+';'), len(@nchar+';'),

    datalength(@nvarchar), datalength(@nchar)

    Correct answer:

    7,7,8,41,14,80

    thanks a lot in advance

    Yep.

    @nchar has data type nchar(40). Nchar is fixed length, so after assigning '7 chars' to it, it will be padded out to 40 characters by adding 33 spaces at the end. The expression @nchar + ';' then adds a semicolon after the 33rd space (the 40th character) for a total of 41 characters. That's the explanation for the 41.

    The result 80 is returned by datalength(@nchar). Remember that @nchar was padded to use all 40 characters. But the data type is nchar(40), not char(40). The difference is that nchar is able to store the full Unicode character set, at the expense of taking two bytes for each character. So to store 40 nchar characters, SQL Server uses 80 bytes.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • thanks a lot Hugo it clears to me now!!

    _______________________________________________________________
    To get quick answer follow this link:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • Got it right 7,7,8,41,14,80

    but thought 7,7,8,8,14,80

    but that was`t in ans 😛

    Neeraj Prasad Sharma
    Sql Server Tutorials

Viewing 13 posts - 31 through 42 (of 42 total)

You must be logged in to reply to this topic. Login to reply