Varchar or Char?

  • I thought the question was a fair one, especially as I'm in the minority of listmembers who chose the answer wisely :-), but I really wanted to say "Thank you" to Hugo et al. for a most-educational discussion!

  • Great discussion, and nice points (as always), Hugo!

  • 1. The character set has already been defined in the question so regardless, nchar and nvarchar are out.

    I see your point but I'm not really sure I agree with you. Just because he said it would be mixed letters and numbers doesn't mean that I won't need to process identifiers using letters and numbers from a different alphabet. My point is that he wasn't specific so a degree of ambiguity exists.

    2. The average length isn't defined but it was a simple calculation. When building a database and constructing tables we rarely know what the contents of the table will be in advance. We therefore must assume they will be approximately equal if there is nothing to suggest the contrary.

    He did say evenly distributed so I will retract my previous remark as that is pretty clear.

    3. I take your point that varchar is more flexible and in future changes might make char less efficient, however at that point we're thinking outside the scope of the question. At the present time with the present facts, the choice is clear, however perhaps in a real setting you might choose differently given the purpose of the column

    How is it that I'm thinking outside the scope of the question? My point is that he didn't clearly define the scope of what constituted "efficient" therefore any of the answers could be correct depending on your point of view. As you said, "perhaps in a real setting you might choose differently given the purpose of the column". Since he did specify that it was an "external identifier" I made the assumption that this identifier was the key to an external table. I don't want to have to use a TRIM function every time I refer to the column as that is decidedly NOT efficient and makes the queries less readable. It isn't that using a trim function is hard it just means that I have additional processing that must be performed with every row that is evalulated/returned. Not exactly my idea of efficient.

    For me, efficiency is more about query performance and less about storage considerations. It's not that I discount the importance of storage considerations it's just that these considerations are less significant in this era of relatively cheap storage; however, they become more significant if I'm designing a multi-terabyte data warehouse for example instead of your run-of-the-mill 50 GB OLTP database. Consequently, the definition of "efficiency" may be vastly different.

    "Beliefs" get in the way of learning.

  • Robert,

    With regard to your first point, the question says

    all English letters, both upper and lower case mixed with numeric digits

    These are all a part of the standard ASCII character set and therefore would make the need for a unicode based string moot. If this were any other language then yes, you would need to use unicode but char will cut it in this scenario

    As for the third point, yes, i agree. Coming more from the programmer side of the game rather than the administrative side, a couple of extra kb here and there and a few microseconds extra delay in reading it are worth it for the reduction in time taken to write the procedure. Anything that makes the code neater, easier to read and easier to write is fine by me, but I find these questions are rarely about the real world, rightly or wrongly. Don't forget almost all the questions on here are backed up by a link to BOL, hardly a source that takes into account the practical applications and is simply matter of fact about the information at hand. On this basis I guess you just have to ask yourself; what would Books Online say?

  • Thanks Hugo, another gold nugget of SQL "under the hood".

    Just thought I would add my thanks, and also tag this discussion as I dare say it will be invaluable in the future!

    I must start reading Hugo's blog...

    --------
    [font="Tahoma"]I love deadlines. I like the whooshing sound they make as they fly by. -Douglas Adams[/font]

  • I guess I need to learn to read a little more carefully. My apologies. However, I must take one final whack at this dead horse. The fact that the identifiers have English characters doesn't mean that the app displaying these values isn't displaying them in another language. (I know, why would anyone display identifiers, but the fact is, they often do.)

    I figured I would get the answer wrong because I do tend to think from a practical point of view and not based on a discussion of the internals in BOL. I also figured this debate would be interesting and I would learn something so I'm glad I got a chance to engage.

    I suppose that if I want questions that are less ambiguous I should submit them myself. 🙂

    It's interesting to note that roughly 75% of us answered it "wrong" so perhaps questioning the scope of what constitutes "efficiency" isn't that far off base.

    "Beliefs" get in the way of learning.

  • I'm glad you've found this conversation interesting; I have too 🙂 I tend to think these questions are like ones you get when you go for an interview for a DBA job or something, technically they're correct but their context in real life is lacking in the extreme and life is very rarely clear cut, unambiguous or pre-defined in its scope.

    Even if you didn't read the question that clearly, are you really doing anything any one of us haven't done before? I can't remember the number of times i've misunderstood a specification or a request and come out with something completely wrong. I just take those lessons and move on hoping not to repeat them 🙂

  • Robert Frasca (10/16/2009)


    I guess I need to learn to read a little more carefully. My apologies. However, I must take one final whack at this dead horse. The fact that the identifiers have English characters doesn't mean that the app displaying these values isn't displaying them in another language. (I know, why would anyone display identifiers, but the fact is, they often do.)

    That's misreading the question. English characters in the db have nothing to do with the application. The idea is not for you to pick apart the question and look for an exception. It's to learn something about the simple case.

    I would agree with you that the "efficient" work is misleading. It doesn't really define the criteria. I picked varchar, thinking that above 4-5 char, I just do that. I don't think there's been a reason in most apps to not pick it. There might be, and as Hugo mentioned, your app might require dealing with spaces, and then it would be important.

    However with 1TB disk drives, and SQL reading in pages, I don't think the storage space, or performance is a big deal. It seems like it's a case of arguing if the Ferrari or the Lambrogini is a better sports car. They're both quick.

  • I agree with cengland0.

    Amol Naik

  • If no special criteria other than length is specified I usually select varchar (yes, I'm a programmer). However given the fact that the data was to be used as "identifiers", coupled with the relatively short length, I picked char. Indexed or not, the column is likely to be frequently used in where clauses. My instincts say that for those reasons, char will be more efficient.

    I get the feeling that John derived this question from a real world application rather than just for kicks.

    Tom Garth
    Vertical Solutions[/url]

    "There are three kinds of men. The one that learns by reading. The few who learn by observation. The rest of them have to pee on the electric fence for themselves." -- Will Rogers
  • Wow. Amazing amount of information in this discussion, folks. I thank you all, especially those who pointed out alternate takes on the question and explanation.

    Yes, when formulating the question, I was thinking only of disk space used when I said "efficient", so please accept my apologies for that bit of ambiguity. I do thank Hugo for stepping in with his disection of the various issues raised -- I've learned a lot from that discussion.

    Also, as Steve alludes to in his post a couple ahead of this one, generally I've found that the QOD is focused on an idealized case to emphasize a particular point. The point I was thinking of here is that sometimes what appears to be an obvious choice may not be what was intended. In this case, if you really wanted to save every possible byte of storage, then char(10) would be better than varchar(10). Yes, in the 'real world' most of the time varchar(10) may be just as good in a practical sense, and would provide a bit more flexibity for possible future requirements changes, but this is QOD-land, not the real world.

    One more thing: I wouldn't have guessed that the 57% of respondants who chose varchar(10) were only interpreting "efficient" as having a broader meaning that included CPU usage and so on, but that most of them didn't work out the actual disk space usage to see that varchar(10) requires an average of a half byte more than char(10).

  • Tom Garth (10/16/2009)


    .....I get the feeling that John derived this question from a real world application rather than just for kicks.

    Actually, yes. I've been going through a database structure provided by a vendor that sells their software globally. We use it only in the USA with English, and we've found that we would be able to cut the size of our databases by a substantial amount by changing many of the nchar() columns to varchar(). There are a few of those columns that vary in length, but only slightly, and use most of the vendor-allocated column length, so they will be changed from nchar() to char(), and not varchar().

  • john.arnott (10/16/2009)


    Tom Garth (10/16/2009)


    .....I get the feeling that John derived this question from a real world application rather than just for kicks.

    Actually, yes. I've been going through a database structure provided by a vendor that sells their software globally. We use it only in the USA with English, and we've found that we would be able to cut the size of our databases by a substantial amount by changing many of the nchar() columns to varchar(). There are a few of those columns that vary in length, but only slightly, and use most of the vendor-allocated column length, so they will be changed from nchar() to char(), and not varchar().

    This is largely off topic but I found this post interesting.

    Just out of curiosity, do you need to cut the size of your database substantially or does it just offend you that so much space is theoretically wasted? I can't help but wonder if the cost of changing the data types can be recovered by the savings gleaned from shrinking your databases? (Since the cost of storage is significantly less than a dollar a GB you'll need some pretty serious space savings to make it worth your while.) While it's true you might be able to fit a few more rows on each page, quantifying performance gains gleaned from shrinking the database is a little harder to do but is almost certainly measured in microseconds and the loss of scalability into foreign markets is a fairly significant trade off.

    I don't know whether you've done this or not but storing your data in unicode (as English) and then displaying it in your apps using another language is really pretty easy but if your data isn't in unicode you're seriously screwed should that requirement suddenly surface. The world is flat and unicode is a fairly insignificant cost of doing business. Clearly, your vendor understands this.

    "Beliefs" get in the way of learning.

  • Yes, hard drives now cost in the range of a dollar a GB, but the line of business I support has to pay the enterprise infrastructure division quite a bit more than than on an ongoing basis as a pro-rated share of support services. I only wish I could say "go down to the nearest Best-Buy and pick up a couple of 2TB USB drives". Nope. In this (and I suppose most large corporations), managed storage is priced to the user based on the full estimated cost of ownership.

  • See I got it wrong for two reasons:

    I was a knuckle head and missed the "English only" so yes varchar or char. The choice between the two though, to me, is subjective. Yes VARCHAR's do require an additional 2 bytes as length identifiers (remember VSAM? lol). I question is "Their length will vary from seven to ten characters" ... "with an even distribution of lengths" "Since the average size of the data is 8.5". We can't store data in half bytes (unless you are working with assembler. remember 3270 ASM?). In fact we can't store data in perfect little bytes. There are 32 or 64 bit WORD boundaries. For me 8, 12, or 16 would have been more efficient, but that's just the hardware nut in me.

Viewing 15 posts - 31 through 45 (of 52 total)

You must be logged in to reply to this topic. Login to reply