Varchar or Char?

  • diamondgm (10/16/2009)


    diamondgm (10/16/2009)


    Firstly, we assume that the average is the result of an even spread (that the average length is in fact 8.5 - we only need an average of 8 for storage to be a moot point in char vs varchar).

    That's not an assumption, the even spread is explicitly mentioned in the question. Without it, I agree that the question would have been harder (if not impossible) to answer.

    I think our definition of the word 'explicit' differs dramatically. I did not find the question to indicate that the spread was even as a fact.

    Then our definition of the word 'explicit' does indeed differ dramatically.

    The question reads (direct quote; emphasis added by me):

    "Their length will vary from seven to ten characters, all English letters, both upper and lower case mixed with numeric digits, with an even distribution of lengths."

    What would it need to satisfy your definition of 'explicit'?

    Secondly, within SQL Server, the efficiency may be best to use char in that instance, but take in to account the code operations of trimming and I think it becomes debatable.

    I doubt it. Almost all business applications I have seen have their performance tied to an I/O bottleneck, and have the CPU twiddling it's virtual thumbs while waiting for more data to be read or written. Those few cycles the CPU spends to trim trailing spaces won't affect performance at all, it just means one less thumb to twiddle.

    In the cases where this is true (which is most of the time) white space would aversely affect I/O, no? (Maybe I am very wrong here)

    I/O is affected by number of bytes. Or rather, for SQL Server, I/O is affected by the number of rows per 8K page. If the row length is such that saving a few bytes per row increases the unused space on each page without increasing the number of rows (e.q. when row length goes from 2200 to 2100, you still get only 4 rows per page), I/O will not be affected. When the number of rows increases, less I/O will be required to do the same amount of work.

    Whether bytes contain whitespace or character data is not relevant at all. Just the number of bytes - and most of all the integer part of the outcome of dividing 8,060 by that number.

    I agree with you on a purist level, but practically, I would go with varchar still simply for the programmers' ease.

    How, exactly, is varchar easier for a programmer?

    Have you never had a programmer complain about having to trim on certain fields and not on others and wishing for a "standard" approach to text data?

    😀 Programmers will complain about everything. Even about the requirement to test.

    If a programmer is too lazy to type RTRIM(...) around a column name when coding a report and is prepared to let user performance suffer to satisfy his laziness, I'm done with him or her very quickly.

    (And if push comes to shove, I can always create a view to cast the column to varchar and trim the trailing spaces)


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Damian Widera-396333 (10/16/2009)


    in terms of I/O we could assume fixed length data types are better than varying ones.

    If, and only if,

    a) the average length of data is only 2 characters or less below the maximum length. I'd never dream of storing address information in fixed length columns!

    b) there is no need to preserve trailing spaces. If there is, varchar is the only option that won't corrupt your data.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • diamondgm (10/16/2009)

    diamondgm (10/16/2009)Firstly, we assume that the average is the result of an even spread (that the average length is in fact 8.5 - we only need an average of 8 for storage to be a moot point in char vs varchar).

    That's not an assumption, the even spread is explicitly mentioned in the question. Without it, I agree that the question would have been harder (if not impossible) to answer.

    I think our definition of the word 'explicit' differs dramatically. I did not find the question to indicate that the spread was even as a fact.

    Then our definition of the word 'explicit' does indeed differ dramatically.

    The question reads (direct quote; emphasis added by me):

    "Their length will vary from seven to ten characters, all English letters, both upper and lower case mixed with numeric digits, with an even distribution of lengths."

    What would it need to satisfy your definition of 'explicit'?

    OMG, I didn't see that, my hublest appologies!

    Uncle!

  • Hugo Kornelis (10/16/2009)


    That's reallly a shame, as the only good reasons to use varchar instead of char are a difference of more than 2 between average and maximum length; or a need to preserve trailing spaces.

    Of course, if i need the trailing spaces I would go for char, but I have seen statements where the programmer doesn't want the trailing spaces and uses RTRIM to remove then and also uses RTRIM the in the where clause fetch the records regardless to the spaces:

    SELECT somecol FROM sometable WHERE RTRIM(anothercol)='somevalue'

    And this causes the SQL Server to do a table or index scan (lots of IO) with terrible performance as a result. This issue is not solved completely by using varchar, but I can reduce some of the cases where the programmers are using functions on the left side of the equal sign. The correct way to solve this issue is to teach everyone to be careful of how they use functions in the where clause, and I do that all the time, but... new project, new programmers.

    /Håkan Winther
    MCITP:Database Developer 2008
    MCTS: SQL Server 2008, Implementation and Maintenance
    MCSE: Data Platform

  • hakan.winther (10/16/2009)


    Hugo Kornelis (10/16/2009)


    That's reallly a shame, as the only good reasons to use varchar instead of char are a difference of more than 2 between average and maximum length; or a need to preserve trailing spaces.

    Of course, if i need the trailing spaces I would go for char

    OOPS - misunderstanding alert!

    If you need to preserve trailing spaces, you MUST use varchar. Varchar stores the data with the trailing spaces as they were on input (that is, with default settings). Char pads data with spaces to the specified length and there is no way to reconstruct which spaces were already there and which were added by SQL Server to get the string at the proper length.

    but I have seen statements where the programmer doesn't want the trailing spaces and uses RTRIM to remove then and also uses RTRIM the in the where clause fetch the records regardless to the spaces:

    SELECT somecol FROM sometable WHERE RTRIM(anothercol)='somevalue'

    The real problem of these programmers (apart from not understanding how SQL Server indexes work) is a lack of understanding of string comparisons. The RTRIM is completely superfluous, both for char and varchar datatypes. Trailing spaces are compared in these comparisons anyway. (Well, technically the shorter string is padded with spaces to match the length of the longer string before starting a character-by-character comparison, but that has the exact same effect as ignoring trailing spaces).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • The question asks which is more efficient.

    The explanation describes which uses least space, which is not the same thing as most efficient.

    I suspect that the answer is correct, but would be interested to see a full explanation as to why. How do char and varchar differ in their treatment by the optimiser, for instance?

    And what about storage/performance differences if the column is indexed? The question doesn't say that it is, but I'd expect that it would be.

  • this was a good question

  • Hugo Kornelis (10/16/2009)


    Damian Widera-396333 (10/16/2009)


    in terms of I/O we could assume fixed length data types are better than varying ones.

    If, and only if,

    a) the average length of data is only 2 characters or less below the maximum length. I'd never dream of storing address information in fixed length columns!

    b) there is no need to preserve trailing spaces. If there is, varchar is the only option that won't corrupt your data.

    Hugo

    personally I would use varchar() data type in mentioned case but I just tried to explain why the correct answer is char(). Agree with your remarks 🙂

    Kindest Regards,

    Damian Widera
    SQL Server MVP,
    MCT, MCSE Data Platform, MCSD.NET

  • Toreador (10/16/2009)


    The question asks which is more efficient.

    The explanation describes which uses least space, which is not the same thing as most efficient.

    I suspect that the answer is correct, but would be interested to see a full explanation as to why. How do char and varchar differ in their treatment by the optimiser, for instance?

    And what about storage/performance differences if the column is indexed? The question doesn't say that it is, but I'd expect that it would be.

    Hi Toreador,

    Good questions. Let me try to answer them.

    In general, saving space does equate to saving performance and increasing efficiency when dealing with SQL Server. Most of the system's performance is determined by the amount of data that has to be read from and written to disk, and the amount of data that can be held in cache. There may be borderline cases where CPU cycles determine the performance of an app, but these will be the exceptions.

    Since you ask specifically about the optimizer - I don't think it makes much difference between these two datatypes.

    For indexes, the same I/O reduction considerations apply that hold for all data. Since indexes effectively duplicate some data, I/O reductions in indexed column count double. There's also another consideration when dealing with indexes, and that is fragmentation. If values added to the index are always at the "end" (as is the case with untampered-with IDENTITY columns or with NEWSEQUENTIALID() values), all index pages will neatly fill up and then be complete. If they are scattered, pages that get full will have to be split, which slows down both the insert (page split is relatively expensive) and all subsequent operations (as you now have two half-full pages, thus requiring more I/O for the same amount of data). But char or varchar does not affect this, as they both sort the same.

    One aspect that can reduce performance of varchar columns is how updates are treated when the new value takes more space then the old value. This is not a problem if there still are enough unused bytes on the page - but otherwise, you'll once more run into a page split, with the above disadvantages.

    Everything considered, I'd say that varchar has a few disadvantages that char doesn't have. Extra cpu cycles are needed to find start and end position in the row; and increased chance of page splits as rows are updated with new and longer data.

    I'd say that as a generic rule of thumb, one should:

    * use varchar if trailing spaces need to be preserved, as char can't do that;

    * use char if all strings are of the same length;

    * use char if the average string length differs up to 2 bytes from the maximum string length;

    * use varchar if the average string length differs more than 3 bytes from the maximum string length;

    * use either char or varchar, depending on actual sitation, if average length and maximum length differ more than 2 but not more than 3 bytes - but without additional info, I'd probably choose char in this case.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • I based my answer on reality. In reality data is not immutable. What is true today probably won't be true six months down the road. If this was an interview question how would you answer?

  • Fatal Exception Error (10/16/2009)


    I based my answer on reality. In reality data is not immutable. What is true today probably won't be true six months down the road. If this was an interview question how would you answer?

    I agree, it is hard to predict the future, and decisions made on the assumptions that the data will be distributed in the same way in the future may in the end cause bottlenecks.

    /Håkan Winther
    MCITP:Database Developer 2008
    MCTS: SQL Server 2008, Implementation and Maintenance
    MCSE: Data Platform

  • While I like this type of question in theory, apparently, in this example "efficient" is defined only in the context of how many bytes are being stored. When asking this kind of question it should be clear what the metric is for defining what constitutes "efficient". If you don't then the choice of answers is subjective. For example:

    1. Using nchar or nvarchar might be more "efficient" if I'm working on an application that requires multi-language support.

    2. The assumption that the actual length would be 8.5 characters wasn't stated. Suppose the average is less than 8? What about padding for lengths less than 8? Is it more efficient to have to write apps that require the use of trim functions?

    3. The whole point of using varchar and nvarchar is to allow flexibility. It might be argued that flexibility is more efficient.

    4. Is the use of a variable length "identifier" even a rational choice, let alone efficient choice, in the first place? I probably wouldn't use a variable length field as an identifier. I would use a fixed length, preferably integer, surrogate key as an identifier and store the native key (external identifier) separately. It would be less efficient from a storage point of view but more efficient from several other points of view including as a primary key and/or clustered index.

    We could argue endlessly about what is the most efficient datatype in a host of scenarios so, from a QOD point of view, it's important to define the scenario to avoid ambiguity.

    "Beliefs" get in the way of learning.

  • Thanks Hugo 🙂

  • I was all smug in the fact that I realized NVarchar would be less efficient than Varchar in this instance and checked varchar. As I sipped my coffee and waited for my CORRECT response i was taken a back by the Sorry you are wrong answer. Once I read the explanation and realized my mistake, i again learned something about how to read and answer these questions and a little more on the data types.

    Keep these questions coming.. you always learn from your mistakes

    Thanks

    Steve

  • I thought he was pretty clear on the details. There was a similar question a few months ago that simply asked which was more efficient? Char, varchar, nchar or nvarchar? That question was rightly attacked for being unclear and vague in what efficient meant, and in what context.

    In this case, details included the length, the distribution of that length and the language used, which is all the information needed to come to the correct answer.

    With regard to your points

    1. The character set has already been defined in the question so regardless, nchar and nvarchar are out.

    2. The average length isn't defined but it was a simple calculation. When building a database and constructing tables we rarely know what the contents of the table will be in advance. We therefore must assume they will be approximately equal if there is nothing to suggest the contrary.

    3. I take your point that varchar is more flexible and in future changes might make char less efficient, however at that point we're thinking outside the scope of the question. At the present time with the present facts, the choice is clear, however perhaps in a real setting you might choose differently given the purpose of the column

    4. I don't really have any problem here, I would agree with you that an integer based key would make more sense.

Viewing 15 posts - 16 through 30 (of 52 total)

You must be logged in to reply to this topic. Login to reply