Other methods of getting the true length of UTF8 String?

  • Something I've been playing around with lately is the UTF8 functionality in 2019, and it dawned on me that getting the "true" length a string in a UTF8 string is actually not as easy as for a "normal" varchar (and nvarchar).

    Let's take the following table:

    CREATE TABLE dbo.TestTable (VarcharString varchar(30) COLLATE Latin1_General_100_CI_AS,
    NvarcharString nvarchar(30) COLLATE Latin1_General_100_CI_AS,
    UTF8String varchar(30) COLLATE Latin1_General_100_CI_AS_SC_UTF8);
    GO
    INSERT INTO dbo.TestTable (VarcharString,NvarcharString, UTF8String)
    SELECT V.String,V.String,V.String
    FROM (VALUES(N' Simple string '), --19 characters
    (N' Weißbier '), --14 characters
    (N' Smile!?? '))V(String); --14 characters (?? counts as 2)

    So here we have 3 strings, one 19 characters, and the other 2 are 14 characters in length.  Getting the length of the 2 strings for the varcharand nvarcharvalues is easy DATALENGTH() and DATALENGTH()/2 respectively, however, that doesn't work for the UTF8 string. If you try the below you'll see this quickly:

    SELECT DATALENGTH(VarcharString) AS VarcharLen,
    DATALENGTH(NvarcharString) / 2 AS NvarcharLen,
    DATALENGTH(UTF8String) AS UTF8Attempt1,
    DATALENGTH(UTF8String) / 2 AS UTF8Attempt2
    FROM dbo.TestTable;

    The varchar and nvarchar columns return the "correct"  lengths for their values, however, not for the UTF Column. This is expected and in a UTF8 column, the data length of the character can vary. In this case the ß character has a datalength of 2, and the ?? a length of 3.

    The method I instead used was the expression for UTF8Length, however, the emoji does return a different value, due to that it's counted as 2 characters in  both an nvarchar and varchar but 1 for a UTF8. Hosnetly, that's expected as the emoji's value doesn't fit in a 2 byte character.

    SELECT UTF8String,
    LEN(UTF8String) AS Length,
    DATALENGTH(UTF8String) AS Datalength,
    LEN(TRIM(UTF8String)) + (DATALENGTH(UTF8String) - DATALENGTH(TRIM(UTF8String))) AS UTF8Length
    FROM dbo.TestTable;

    It's a bit more long winded, however than normal methods, and wondered if I'm missing something here. I *suppose* you could change the collation of the column DATALENGTH(UTF8String COLLATE Latin1_General_100_CI_AS) but I kind of wanted to avoid doing that (don't know why).

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • ... sigh... The ?? in the above is not actually ?? it's a smiley face emoji; seems that SSC stores posts in a varchar column and the value is lost. Interesting...

    Here's a non-malformed DB<>Fiddle for reference.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • .

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Steve Collins wrote:

    Have I stopped making sense?  Really?

    I'm not sure of your point here.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • .

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Steve Collins wrote:

    Thom A wrote:

    Steve Collins wrote:

    Have I stopped making sense?  Really?

    I'm not sure of your point here.

    My point is that those 2 business partners are running a complete and utter all out fraud that's harmful to the community.  My (in)sanity is not the point.

    What 2 business partners? SQL Server and UTF8?

    I suppose, to answer your original comment "Have I stopped making sense?  Really?" the answer is "Yes, you have stopped making sense." as you aren't making any sense here.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • .

    Aus dem Paradies, das Cantor uns geschaffen, soll uns niemand vertreiben können

  • Thom A wrote:

    Steve Collins wrote:

    Thom A wrote:

    Steve Collins wrote:

    Have I stopped making sense?  Really?

    I'm not sure of your point here.

    My point is that those 2 business partners are running a complete and utter all out fraud that's harmful to the community.  My (in)sanity is not the point.

    What 2 business partners? SQL Server and UTF8?

    I suppose, to answer your original comment "Have I stopped making sense?  Really?" the answer is "Yes, you have stopped making sense." as you aren't making any sense here.

     

    Yeah, this entire stream of messages is utterly confounding, as making much sense of Mr. Collins' comments isn't possible without assuming that an entirely deleted post or two or three, simply no longer exist in this thread.   Even his final response of "Fine..." etc, includes an italicized word "here" at the end, which also draws the eye in and yet offers no particular reason to suggest why it was italicized, without once again assuming at least two posts got deleted in their entirety, and perhaps due to malicious links?

    As Thom asks, I must also ask:  What two business partners?   Even assuming SQL Server and UTF-8 is making a rather sizable leap...  but I certainly have to commend him for at least being able to think enough to offer up that suggestion.  It's a lot more than I came up with...   Anyone?   Anything?   Beuhler ?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I wonder if Steve Collins got "lost" and thought they were in a different thread? Either way, it's completely derailed my original question. 🙁

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A wrote:

    I wonder if Steve Collins got "lost" and thought they were in a different thread? Either way, it's completely derailed my original question. 🙁

    Yep, afraid so...   Anyway, it was certainly interesting to hear about SQL 2019 supporting UTF-8, which I didn't even know was something different yet again from NVARCHAR().   Given the "strangeness" of that, I'm pretty sure I would NOT want to use it with any application that required knowledge of the length of any given UTF-8 character string, if for no other reason than having a user type of person needing to understand what "string length" even means with that kind of string data type.   Sounds like just too darned much trouble to mess with to me...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Is your db at compatibility 150 level?  I know it almost certainly is, but I'm trying to think of anything that might cause DATALENGTH() to not work correctly (and I don't have SQL 2019 installed yet to try it myself).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    Is your db at compatibility 150 level?  I know it almost certainly is, but I'm trying to think of anything that might cause DATALENGTH() to not work correctly (and I don't have SQL 2019 installed yet to try it myself).

    DATALENGTH will work correctly every time; it will return the length of the data of the string. The important note, however, is that for UTF8 each character can have a different value for it's data length. Take the below example, where each cahracter has a different value for DATALENGTH:

    CREATE TABLE dbo.TestTable (UTF8Char varchar(10) COLLATE Latin1_General_100_CI_AS_SC_UTF8);

    INSERT INTO dbo.TestTable (UTF8Char)
    VALUES (N'i'),(N'ß'),(N'?'),(N'');

    SELECT UTF8Char, DATALENGTH(UTF8Char)
    FROM dbo.TestTable;
    GO

    DROP TABLE dbo.TestTable;

    DB<>Fiddle (DB Fiddle, again, because SSC is corrupting the values by using a varchar not an nvarchar; do not trust the SQL above on it's own...)

    • This reply was modified 3 years, 6 months ago by  Thom A.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • DATALENGTH() is supposed to return the byte length of the entire value.  Is is not doing that?  MS's doc seem to imply that it should (then again, we've all seen MS docs be way off before).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher wrote:

    DATALENGTH() is supposed to return the byte length of the entire value.  Is is not doing that?  MS's doc seem to imply that it should (then again, we've all seen MS docs be way off before).

    And that is what it's doing, I haven't disputed that. The question is about the length (including trailing spaces), not the datalength; length and datalength are 2 different things.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • I admit I'm confused.

    LEN() has always not counted trailing spaces.

    When I need to count trailing spaces, I usually just add a non-space and then subtract 1:

    SELECT LEN('ABC' + SPACE(3)), LEN('ABC' + SPACE(3) + '.') - 1

     

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply