UTF8 and varchar lengths

  • Regarding the explanation for the correct answer, the reason for why it couldn't be 10,000 is correct. But, the first part regarding VARCHAR , NVARCHAR , and UTF-8 could use some clarification.

     

    When declaring a length of a varchar (or nvarchar) the length parameter does not mean the total number of characters that can be stored, but the number of bytes. Both datatypes, without using MAX, can store up to 8,000 bytes; varchar stores the characters using single bytes while nvarchar double bytes (hence why it is capped at 4,000).

    The length parameter for CHAR / NCHAR / VARCHAR / NVARCHAR is actually neither characters nor bytes. While for CHAR and VARCHAR the value does always equate to bytes, for "typical" usage it's more closely aligned with characters across all 4 of those datatypes. The length parameter, however, is actually the number of "code units" it can hold. Code units are how the data is actually read. For CHAR and VARCHAR a code unit is always 1 byte, and NCHAR and NVARCHAR are always 2-bytes. Keep in mind that a single code unit "typically" equates to a single character, but even prior to SQL Server 2019 introducing UTF-8,  CHAR and VARCHAR also supported characters that require two code units (when using any of the 4 double-byte code pages). With UTF-8, characters can be represented by 1, 2, 3, or even 4 code units. For NCHAR and NVARCHAR (always encoded as UTF-16), the first 65,536 characters (the BMP range) are represented by a single code unit (2 bytes), but the remaining 1,048,576 characters (the supplementary range; not all mapped yet) are all represented by two code units (4 bytes).

     

    When using a collation such as a UTF-8 collation, each characters stored in the string can use different amount of bytes. A character like "A" uses a single byte, "æ" uses 2 bytes, and "◘" uses 3.

    1. Along with UTF-8, the 4 double-byte code pages (for CHAR and VARCHAR) as well as NCHAR and NVARCHAR (always UTF-16 Little Endian), are all variable-width encodings (as explained above with regards to code units).
    2. Also, UTF-8 represents all supplementary characters using 4 bytes (same for UTF-16, and even UTF-32, though that encoding is not supported in SQL Server).

     

    For a more detailed examination of how the string datatypes store characters, please see the following post of mine:

    How Many Bytes Per Character in SQL Server: a Completely Complete Guide

     

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky wrote:

    Eirikur Eiriksson wrote:

    I realized the SSMS settings affecting the results as soon as I posted, got a fresh install on this laptop ("Vanilla") SQL Server Management Studio 15.0.18338.0

    😎

    The SQL Server is

    Microsoft SQL Server 2019 (RTM-GDR) (KB4517790) - 15.0.2070.41 (X64) Oct 28 2019 19:56:59 Copyright (C) 2019 Microsoft Corporation Developer Edition (64-bit) on Windows 10 Pro 10.0 <X64> (Build 19041: )

    Ok. I'm using nearly the same version of SSMS (18.7.1 == 15.0.18358.0). And, I downloaded a container to get the GDR patch-level:

    Microsoft SQL Server 2019 (RTM-GDR) (KB4517790) - 15.0.2070.41 (X64)

    Oct 28 2019 19:56:59

    Copyright (C) 2019 Microsoft Corporation

    Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS) <X64>

    So, the SQL Server version is the same, only difference is running on Linux instead of Windows. I ran your test code and got back the expected 100,000. Not sure what is causing you to get back 8000. What is the DB's default collation?

    Also, regarding changing the query result option to 1,000,000 but still getting back 65,536: that setting most likely only takes effect for new query windows, not existing ones.

    Take care,

    Solomon..

    You are right Solomon, restarting the SSMS instance brings back 100,000 characters.

    😎

    The default server collation is Latin1_General_CI_AS

  • Eirikur Eiriksson wrote:

    You are right Solomon, restarting the SSMS instance brings back 100,000 characters.

    😎

    The default server collation is Latin1_General_CI_AS

     

    Sorry, I realize now that I said "query window", but I meant to say "query tab". You don't need to shut down SSMS for that type of change, just close that tab and open another. If you're changing a font, then you need to fully exit SSMS and open it again (if I remember correctly).

    Also, I just re-tested with using "Latin1_General_CI_AS" as the instance level and DB level collation and am still getting the expected 100,000 value returned from LEN(). Something must be different with your setup (though I can't think of what would cause this particular issue), or perhaps there is something going on with the query that is not visible. Is that the exact query that's running? Can you reproduce this on any other instance of SQL Server? Can you create the table in [tempdb] on that instance and run that same test?

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 3 posts - 16 through 17 (of 17 total)

You must be logged in to reply to this topic. Login to reply