Do You Deal with UTF-8?

  • Comments posted to this topic are about the item Do You Deal with UTF-8?

  • Seems like this will just add more stuff to be concerned with. Typically I just always use nvarchar thus internationalised text data is not a problem and since we primarily use .NET nor is it a problem using that text in code.

    A quick double check shows that UTF8 uses 1-4 bytes per character and UTF16 (nvarchar) uses 2-4 bytes - thus typically you are saving some space but is the extra space you'd save worth the hassle of more collations and type concerns? IMHO - it's not worth it.

  • it's not so easy as you might think. Greg Low had a note on it few days ago

    https://blog.greglow.com/2019/07/25/sql-think-that-varchar-characters-if-so-think-again/

     

  • I don't expect to bump into a database with UTF-8, unless it's something like a 3rd party ISV application that enables it by default. What is the use case for this?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • It's used a lot in code, here's a few examples.

    https://annevankesteren.nl/2009/09/utf-8-reasons

    I've never run into it in my career, but I've primarily been a US English data pro, so not a lot of other data to worry about.

  • I have been looking forward to UTF-8 support.  Our ERP and website use UTF-8 data, and our replications to SQL Server sometimes contain weird data in place of a ™ or other symbol.  Our database was originally set up as varchar, not nvarchar.  We are starting to work more with foreign languages, too.  I have had success working with UTF-8 in a different database at my last job, and was surprised it was not included in SQL Server yet.  I am hoping to get some of my databases switched over.

    When working with UTF-8 and defining varchar fields, I had to specify whether the length was 20 bytes or 20 char.  I think that bytes was the default if I didn't specify, but char was always what I needed.  I will look to see if there is something similar in SQL Server.

    One thing that I am concerned about is that we are currently using a case insensitive collation.  If there is not a CI collation for UTF-8, it may be more difficult to convert, because all the queries will need to be updated.


    Update:  I just read the link in Steve's editorial:  https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/  There is a lot of good information in there.  It looks like the UTF-8 collation is case insensitive, so I shouldn't have too much rework.  I didn't see any mention of specifying the number of characters.  It seems that everything is in byte lengths, so I may need to make a few fields wider.  Since most of my data is in the 0-127 range, it shouldn't be too bad.  Most of them have already been lengthened to accommodate the extra bytes already.

    • This reply was modified 4 years, 8 months ago by  nguntert.
    • This reply was modified 4 years, 8 months ago by  nguntert.
  • rgloeckner wrote:

    it's not so easy as you might think. Greg Low had a note on it few days ago

    https://blog.greglow.com/2019/07/25/sql-think-that-varchar-characters-if-so-think-again/ 

     

    Hi there. In its current form, that post contains several technical inaccuracies and should not be used as a reference. I have just posted a reply to correct 6 - 7 statements but the reply is currently under moderation so you can't see it. Hopefully he either approves it (my reply) or at the very least makes the corrections. I don't want to come off as being a jerk or snarky or arrogant or whatever is conveyed by short, plain text messages (i.e. very conducive to being interpreted as more harsh than intended), but that post is (currently) mostly misinformation that only serves to further confuse people. For a complete and accurate guide to how SQL Server stores "characters", please see:

    How Many Bytes Per Character in SQL Server: a Completely Complete Guide

     

    Take care, Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • nguntert wrote:

    Update:  I just read the link in Steve's editorial:  https://sqlquantumleap.com/2018/09/28/native-utf-8-support-in-sql-server-2019-savior-false-prophet-or-both/  There is a lot of good information in there.  It looks like the UTF-8 collation is case insensitive, so I shouldn't have too much rework.  I didn't see any mention of specifying the number of characters.  It seems that everything is in byte lengths, so I may need to make a few fields wider.  Since most of my data is in the 0-127 range, it shouldn't be too bad.  Most of them have already been lengthened to accommodate the extra bytes already.

    Hi there. Glad you found the info useful. I have since updated the article to remove some obsolete stuff, rearrange some things, add some new info, etc. I can also say that there is no singular UTF-8 collation (i.e. it's not "the UTF-8 collation"). The UTF-8 collations are mostly just a duplicate of all of the "_SC" and "_140_" collations with "_UTF8" added to the name. The only difference for the "_UTF8" version is that the encoding used for VARCHAR data (columns, variables, and literals) is UTF-8 / code page 65001 instead of one of the 8-bit code page normally associated with that locale (i.e. "Latin1_General" = code page 1252, "Hebrew" = code page 1255, etc). Hence, you can still tailor the sorting and comparison sensitivities for any non-binary UTF8 collation you use (i.e. case, accent, width, kana type, and variation selector).

    And yes, when specifying max size/length for the string types (anything that will accept a number), it is always in terms of "code units". Yes, most people, and maybe even most documentation/books, has characterized this value as being "characters", but that is oversimplifying it. To be fair, there was a time when it was accurate to say that the n in VARCHAR(n)  was "characters" and it was never incorrect, but that ended 20 years ago when SQL Server 2000 was released. But, "code units" are 1 byte in 8-bit datatypes (i.e. VARCHAR and the deprecated TEXT ) and 2 bytes in 16-bit datatypes (i.e. NVARCHAR and the deprecated NTEXT ). In UTF-16, "characters" are either 1 or 2 code units (i.e. 2 or 4 bytes). In UTF-8, "characters" are 1 - 4 code units (i.e. 1 - 4 bytes).

    For more details regarding characters vs bytes, please see:

    How Many Bytes Per Character in SQL Server: a Completely Complete Guide

     

    Take care,

    Solomon...

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply