• Hugo Kornelis (8/4/2012)


    Sorry for the late reply; I was on holiday. (FWIW - Spain, France, and Italy were all great places to be!)

    No need to apologise for the delay- no-one should expect instant repsonses. Anyway, if everywhere was great you should be gald you were there, not sorry!

    To reply to your direct question: "LEN isn't doing anything with padding except ignoring it" - LEN is not ignoring padding (which is the addition of extra spaces to arrive at a certain length); it is ignoring explicitly added and stored whitespace at the end.

    I don't think that it's any the less padding because a human, rather than some automton, added it. But I guess it depends on what you choose to call "padding" - language tends to be a bit flexible in places.

    I will concede that there are good arguments for ignoring trailing whitespace, the best being that a human would ignore that trailing whitespace too (because in print, it is invisible). For me personally, I think the arguments in favor of retaining whitespace are stronger. But more to the point - those in favor of ignoring whitespace should advicate not storing it at all, rather than storing it and then disregarding it in some (but not all!) cases.

    I think you are arguing for something which has a simple logical description. What we have is instead something which the designers of the standard thought was practical/useful, not necessarily simple, which needs complex (or convoluted) logic to describe it.

    I don't think there is any clear right way of handling these things. On the one hand, the thoroughly simple system in which LEN counts trailing spaces (and DATALEN counts bytes) is would be painful (writing rtrim all over the place - if we are comparing two columns, that destroys our ability to use indexes; maybe everyone could write things so that they never stored trainling spaces?) to use, and switching to it would of course break a lot of existing code. On the other hand, one could fall into all sorts of horrible complexity if we tried to work out a system that made everything we might want to do with strings simple (this is a classical dilemma - trying to make every easy for the user can lead to a complex nightmare of a UI).

    We need some sort of tests that tell us whether two strings are the same under some sameness rules, and those sameness rules have to take care of collation (including collations like for example latin1_general_ci_ai, which lead to equating things which are in some ways very different). Are two strings the same if they both represent the same sentence? If they are, we should be ignoring leading spaces as well as trailing spaces when we test, and even worse, we should be treating internal multiple spaces as single spaces, and internal line feeds and carriage returns as being equivalent to space - with an awkwardness about hyphens of course. The compromise we have (use the collation and ignore trailing spaces) is not perfect, but it's good enough for us to live with.

    At the same time we need some way of handling layout which, in the days before the enormous popularity of html, which is when the SQL conventions date from, meant paying attention to all spaces and carriage returns and line feeds.

    Should the concatenation rules treat padding the same way as the compromise equality rule that has been adopted? If yes, it becomes slightly more difficult (perhaps more fiddly, really) to handle layout. But maybe we should have several different concatenation functions, instead of just one. Should we have a greater variety of comparison operations for strings? Probably yes - if we had an equality operation that didn't ignore trailing spaces it would be a useful addition to the string operations that SQL has (but it wouldn't be a replacement for the current string = operation, just an extra operation - but I would be infavour of using "=" for the new operation and maybe "=~" for the existing one. We would want a few more sameness operations too, perhaps, to create an ideal world: for example do we want operation which delivers false when testing 'a' agains N'a'?

    At least with the length function we have two of at least four required to give a decent set: number of bytes per character is an orthogonal issue to whether trailing spaces count, and SQL has conflated them, leaving us with an incoherent pair.

    Tom