Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

The Scary DBA

I have twenty+ years experience in IT. That time was spent in technical support, development and database administration. I work forRed Gate Software as a Product Evangelist. I write articles for publication at SQL Server Central, Simple-Talk, PASS Book Reviews and SQL Server Standard. I have published two books, ”Understanding SQL Server Execution Plans” and “SQL Server 2008 Query Performance Tuning Distilled.” I’m one of the founding officers of the Southern New England SQL Server Users Group and its current president. I also work on part-time, short-term, off-site consulting contracts. In 2009 and 2010 I was awarded as a Microsoft SQL Server MVP. In the past I’ve been called rough, intimidating and scary. To which I usually reply, “Good.” You can contact me through grant -at- scarydba dot kom (unobfuscate as necessary).

Characters


No, I’m not talking about a Dickens novel. I’m talking about the number of characters in a string. I had a painful time recently because of the word “characters.” 

If you take a look at the dynamic management view sys.dm_exec_sql_text you can get the queries that have been run on your system that are still in the cache. It’s a great utility. Better still, you can get specific statements from the code that are actively running through sys.dm_exec_requests or ones that have run through sys.dm_exec_query_stats. To do this is very simple. Each of these DMV’s has a pair of columns, statement_start_offset and statement_end_offset. These columns, and I’m quoting directly from books online measure the “number of character” offset from the beginning of the SQL string and from the end of the SQL string. Using these values you can retrieve an individual statement out of a stored procedure that has multiple statements.

But… Here’s where things get tricky. Try this on your machine:

SELECT SUBSTRING(dest.text, (der.statement_start_offset ) + 1,
(der.statement_end_offset - der.statement_start_offset) + 1)
,LEN(dest.text) AS CharLength,
der.statement_start_offset,
der.statement_end_offset
FROM sys.dm_exec_query_stats AS der
CROSS APPLY sys.dm_exec_sql_text(der.sql_handle) AS dest
WHERE der.statement_end_offset > -1

You might get an error or you might get a bunch of really odd looking statements in the first column, starting part way into TSQL and cutting off after they’re done or before they’re over. It’ll look odd. But what’s the deal? The SUBSTRING function should work. Logically it’s configured correctly. Here’s the problem.

The [text] column in sys.dm_exec_sql_text is of the datatype NVARCHAR(MAX). Unicode. If you look at the length of the text, it’ll tell you exactly how many characters you see in the string that called to your server. But, the statement_start_offset and statement_end_offset are measuring something different. They’re not measuring characters, they’re measuring unicode characters. Try this query instead:

SELECT SUBSTRING(dest.text, (der.statement_start_offset / 2) + 1,
(der.statement_end_offset - der.statement_start_offset) / 2+ 1),
LEN(dest.text) AS CharLength,
DATALENGTH(dest.text) AS DLength,
DATALENGTH(dest.text) / 2 AS HalfDLength,
der.statement_start_offset,
der.statement_end_offset
FROM sys.dm_exec_query_stats AS der
CROSS APPLY sys.dm_exec_sql_text(der.sql_handle) AS dest
WHERE der.statement_end_offset > -1

You can see that the character length is, whatever it’s supposed to be, but the DATALENGTH is twice that much. Unicode, as we all know, includes a byte to identify the character set. That’s included in the character count in statement_start_offset and statement_end_offset.  You need to take that into account when dealing with these “characters.”

Comments

Posted by DonaldW on 25 October 2009

Good article, Grant.  I knew that Unicode takes up two bytes per character but it probably would not have ocurred to me that it would impact a function that uses a character count.  You have saved a lot of people a lot of foundering with this one.  Thanks for the heads up!

Posted by Tao Klerks on 26 October 2009

I think your intent is clear, but would be clearer if you replaced "They’re not measuring characters, they’re measuring unicode characters" with "They’re not measuring characters, they’re measuring BYTES of an NVarChar string".

Basically BOL has it wrong, the offset is in Bytes and not Characters

Posted by Grant Fritchey on 26 October 2009

I did want to put it that way, that it was measuring bytes, but I'm not entirely sure that it's 100% accurate.

Posted by richardd on 26 October 2009

I agree that "measuring bytes" would be clearer than "measuring unicode characters". A unicode character is two bytes, so if they are measuring unicode characters, the first query would work as expected.

Posted by roger.plowman on 26 October 2009

Be careful that "unicode is 2 bytes", because there are cases where it isn't.

Certain unicode characters can be compressed into 1 byte, others can actually take more than 2. This may be an MS convention, and may not apply everywhere. It was done to compress unicode text into a smaller size.

As always, efficiency trumps logic. :)

Posted by vliet on 26 October 2009

Roger, you might get a bit confused by the different ways to store Unicode characters: UCS-2, UCS-4, UTF-8, UTF-16 and UTF-32. Microsoft uses UTF-16 within their SQL Server NCHAR and NVARCHAR values: two bytes represent a code point, and one or two code points represent a character. Many web pages use UTF-8, where a single Unicode character is encoded in one to four bytes. These encodings are not conventions of Microsoft, but a more or less formal standard on encoding Unicode characters.

I agree that the BOL not always states clearly wether bytes, code points or characters are involved. The usual precautions apply: assume nothing, seeing is believing! Working with NCHAR and NVARCHAR values has many benefits (especialy here in Europe) once you get used to it.

Leave a Comment

Please register or log in to leave a comment.