Blog Post

What Does the Undocumented UNCOMPRESS Function Do?

,

(last updated: 2019-03-18 @ 14:55 EST / 2019-03-18 @ 18:55 UTC )

IntelliSense in SQL Server Management Studio (SSMS) can be quite helpful. It can save time typing by presenting possible object or function names. And, it can even show you the signatures (i.e. input and output parameters / return types) of stored procedures and functions.

Another thing that IntelliSense does, even though it was certainly not intentional, is show you the names of (some) undocumented functions. Many of these undocumented functions won’t let you execute them, so not much to investigate there1 . But, several of these undocumented built-in functions can be executed. One of them is UNCOMPRESS. This function is not to be confused with DECOMPRESS, the companion to COMPRESS, which are actually GUnzip and GZip, respectively (and were introduced in SQL Server 2016, if you haven’t seen them before).

No, UNCOMPRESS does something entirely different, but the only documentation is the function signature that appears in SSMS:

built-in function UNCOMPRESS(Param1 varbinary(1)) RETURNS nvarchar(1)

I did a bit of searching around, and all I could find were a couple of references to using it on the ctext column in sys.syscomments, but only if the status column (a bit-masked value) had the “2” bit set (i.e. status & 2 = 2 ). The first two entries in the “Mentions of UNCOMPRESS()” section at the end of this post are books that contain this same info. With only that one clue to go on, I used the following query to find some data that was meant to be passed into this function:

SELECT *
FROM   sys.syscomments
WHERE  [compressed] = 1
OR     [status] & 2 = 2;

I tried in [master], [msdb], and even [MSSQLSystemResource] (in single-user mode) but no rows were ever returned (I tested in SQL Server 2017). Between being used for a deprecated (as of SQL Server 2005) system compatibility view, and that view not even returning rows that would make use of this function, it seems safe to conclude that this function is obsolete in addition to being undocumented.

Why Document?

Why document an undocumented, unsupported, and possibly obsolete function? Well:

  1. because it was a puzzle to solve, and
  2. because it’s undocumented we don’t have much (or any) info on it, AND because it shows up in SSMS IntelliSense people can find it, AND because it can be executed, people might attempt to use it in their code. Therefore, it’s important to understand how it works and why we shouldn’t use it (aside from it being “undocumented”, which means unsupported, which is enough to convince some folks, but not everyone).

What Now?

With no clear indications of what the UNCOMPRESS function does, we can at least pass in some simple values to see what comes back, and see if we can make sense of the output. For the following tests, please keep in mind that “8-bit” refers to the VARCHAR, CHAR, and TEXT (deprecated) datatypes. And, “16-bit” refers to the NVARCHAR, NCHAR, NTEXT (deprecated),and XML datatypes.

Single Character Tests

We will start with the easiest test by passing in one character.

The first query passes a VARCHAR upper-case “A” (having a value of 0x41) into UNCOMPRESS , and gets back the same character, but with an extra byte of 0x00 added on. This should make sense since this function returns NVARCHAR, which is UTF-16 (characters are either 2 bytes or 4 bytes). The Unicode Code Point is actually U+0041, but SQL Server / Windows / .NET use Little Endian, so the bytes are in reverse order, hence 4100 2 . At this point, the UNCOMPRESS function is doing just what the CONVERT function does, so it seems a little redundant.

The second query passes an NVARCHAR upper-case “B” (having a value of 0x4200) into UNCOMPRESS , and gets back the same character, but this time with an extra two bytes of 0x00 added on. This is not what the CONVERT function does!

-- 8-bit test 1:
SELECT '8-bit' AS [Encoding],
       CONVERT(VARBINARY(10), 'A') AS [Hex"A"],
       UNCOMPRESS(CONVERT(VARBINARY(10), 'A')) AS [Uncompressed"A"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'A')))
           AS [HexUncompressed"A"];
/*
Encoding    Hex"A"    Uncompressed"A"    HexUncompressed"A"
8-bit       0x41      A                  0x4100
*/-- 16-bit test 1:
SELECT '16-bit' AS [Encoding],
       CONVERT(VARBINARY(10), N'B') AS [Hex"B"],
       UNCOMPRESS(CONVERT(VARBINARY(10), N'B')) AS [Uncompressed"B"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'B')))
           AS [HexUncompressed"B"],
       CONVERT(VARBINARY(10), CONVERT(NVARCHAR(5), CONVERT(VARBINARY(10),
           N'B'))) AS [HexConverted"B"];
/*
Encoding   Hex"B"   Uncompressed"B"   HexUncompressed"B"   HexConverted"B"
16-bit     0x4200   B                 0x42000000           0x4200
*/

Two Character Tests

The next set of tests will pass in two characters.

The first query again shows that each VARCHAR byte comes back as the same character, but in the expected UTF-16 LE (Little Endian) encoding.

The second query, passing in two NVARCHAR characters, helps to clarify two things:

  1. Every single byte going into UNCOMPRESS comes back as UTF-16 LE (with the extra 0x00 byte added on). Hence, passing in a character that is already in UTF-16 LE encoding (e.g. “D” being the two bytes 0x44 and 0x00), will have each of its two bytes converted into UTF-16 LE, leaving us with 0x4400 and 0x0000, or 0x44000000 (as you can see in the “HexUncompressed"DD"” field).
  2. Even though strings can store character “0” (i.e. U+0000, (null), \0, 0x00, CHAR(0), NCHAR(0), \x00, and so on), they cannot be displayed, nor can any character after them being displayed (as you can see in the “Uncompressed"DD"” field). This is most likely due to character “0” being the string-terminator in many languages.
-- 8-bit test 2:
SELECT '8-bit' AS [Encoding],
       CONVERT(VARBINARY(10), 'CC') AS [Hex"CC"],
       UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')) AS [Uncompressed"CC"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), 'CC')))
           AS [HexUncompressed"CC"];
/*
Encoding    Hex"CC"    Uncompressed"CC"    HexUncompressed"CC"
8-bit       0x4343     CC                  0x43004300
*/-- 16-bit test 2:
SELECT '16-bit' AS [Encoding],
       CONVERT(VARBINARY(10), N'DD') AS [Hex"DD"],
       UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')) AS [Uncompressed"DD"],
       CONVERT(VARBINARY(10), UNCOMPRESS(CONVERT(VARBINARY(10), N'DD')))
           AS [HexUncompressed"DD"];
/*
Encoding    Hex"DD"       Uncompressed"DD"    HexUncompressed"DD"
16-bit      0x44004400    D                   0x4400000044000000
*/

Confirmation

Just to make sure that UNCOMPRESS only works with single-byte characters, let’s pass in something besides two “C”s or two “D”s: Unicode Code Point U+4234 (i.e. 䈴).

SELECT NCHAR(0x4234),      -- 䈴
       UNCOMPRESS(0x4234), -- B4
       NCHAR(0x42),        -- B
       NCHAR(0x34);        -- 4

 

Which Bytes Are Converted to UTF-16?

Now that we know that we are dealing with single-byte characters, which single-byte characters specifically are they? Are they VARCHAR characters of various code pages? Are they VARCHAR characters from one particular code page? Are they NVARCHAR / UTF-16 characters in the U+0000 through U+00FF range that all have a trailing byte of 0x00? Something else perhaps?

Given that characters with values in the range of 0 – 127 (decimal) / 0x00 – 0x7F (hex) are the same across all code pages / encodings that can be represented in SQL Server, only testing with those (i.e. US English, digits 0 – 9, and some punctuation) often hides / obscures important functional differences. So, we need to test values 128 – 255 / 0x80 – 0xFF across several different code pages / encodings.

Create and Populate Table

The following queries will set up the test data that we need to see (or at least confirm) what is actually happening. Code page 1252 is Latin1 (we are looking at this because it’s used in several collations: anything with “Latin1_General” in the name, French, etc), and code page 1255 is Hebrew (which is distinctly different from 1252, so it will be easy to see differences). Finally, UTF-16 is the encoding used by NVARCHAR data. For each row, we are inserting a single byte in the range of 0x000xFF into each column. We can then easily compare the resulting character of each byte with the output of UNCOMPRESS .

-- DROP TABLE #Uncompress;
CREATE TABLE #Uncompress
(
    [ValueHex] BINARY(1) NOT NULL PRIMARY KEY,
    [Value] AS (CONVERT(TINYINT, [ValueHex])),
    [CP1252] VARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC,
    [CP1255] VARCHAR(10) COLLATE Hebrew_100_CI_AS_SC,
    [UTF16] NVARCHAR(10) COLLATE Latin1_General_100_CI_AS_SC
);
SET NOCOUNT ON;
;WITH cte AS
(
    SELECT TOP (256) CONVERT(BINARY(1), ROW_NUMBER() OVER
                              (ORDER BY (SELECT 0)) - 1) AS [num]
    FROM   master.sys.all_columns
)
INSERT INTO #Uncompress ([ValueHex], [CP1252], [CP1255], [UTF16])
    SELECT num, num, num, num
    FROM   cte;

View the Evidence

The query below will show us the character that each byte represents in each of the three encodings. It also feeds that same byte to the UNCOMPRESS function, and shows the underlying byte representation of each character after that byte is stored in the NVARCHAR column and passed into the UNCOMPRESS function. And, because the characters for each byte in the range of 0x000x7F are the same across the encodings, the query only returns the 0x800xFF range (you can easily comment out the WHERE clause to see the boring 0x000x7F range).

There are additional notes below the query.

SELECT chr.[Value],
       chr.[ValueHex],
       chr.[CP1252],
       CONVERT(BINARY(2), UNICODE(chr.[CP1252])) AS [CP1252],
       chr.[CP1255],
       CONVERT(BINARY(2), UNICODE(chr.[CP1255])) AS [CP1255],
       chr.[UTF16],
       CONVERT(BINARY(2), UNICODE(chr.[UTF16])) AS [UTF16],
       UNCOMPRESS(chr.[ValueHex]) AS [UNCOMPRESS],
       CONVERT(BINARY(2), chr.[UTF16]) AS [UTF16bytes],
       CONVERT(BINARY(2), UNCOMPRESS(chr.[ValueHex])) AS [UNCOMPRESSbytes]
FROM   #Uncompress chr
WHERE  chr.[Value] > 127
ORDER BY chr.[Value];

In the table shown below, please note:

  • The “CP 1252”, “CP 1255”, and “UTF-16” columns are in the form of “Character (Unicode Code Point)”.
  • The right-most two columns — “UTF-16 bytes” and “UNCOMPRESS bytes” — show the actual bytes as they exist in the system, in Little Endian. So, Code Point U+0081 would be represented by the bytes 0x8100.
  • The Unicode Code Point is shown for each character so it will be easier to see which characters are exactly the same as the “UNCOMPRESS” column, especially for the columns that have non-printable characters.
  • Code Points are shown instead of underlying byte value since the underlying byte value is already shown in the two left-most columns. Meaning, the byte 0x0E is the character “à” on code page 1252, while that same byte is the character “א” on code page 1255.

Value (Hex)CP 1252CP 1255UTF-16UNCOMPRESSUTF-16
bytes
UNCOMPRESS
bytes
128 (0x80)(20AC)(20AC) (0080) 80008000
129 (0x81) (0081) (0081) (0081) 81008100
130 (0x82)(201A)(201A) (0082) 82008200
131 (0x83)ƒ(0192)ƒ(0192) (0083) 83008300
132 (0x84)(201E)(201E) (0084) 84008400
133 (0x85)(2026)(2026) (0085) 85008500
134 (0x86)(2020)(2020) (0086) 86008600
135 (0x87)(2021)(2021) (0087) 87008700
136 (0x88)ˆ(02C6)ˆ(02C6) (0088) 88008800
137 (0x89)(2030)(2030) (0089) 89008900
138 (0x8A)Š(0160) (008A) (008A) 8A008A00
139 (0x8B)(2039)(2039) (008B) 8B008B00
140 (0x8C)Œ(0152) (008C) (008C) 8C008C00
141 (0x8D) (008D) (008D) (008D) 8D008D00
142 (0x8E)Ž(017D) (008E) (008E) 8E008E00
143 (0x8F) (008F) (008F) (008F) 8F008F00
144 (0x90) (0090) (0090) (0090) 90009000
145 (0x91)(2018)(2018) (0091) 91009100
146 (0x92)(2019)(2019) (0092) 92009200
147 (0x93)(201C)(201C) (0093) 93009300
148 (0x94)(201D)(201D) (0094) 94009400
149 (0x95)(2022)(2022) (0095) 95009500
150 (0x96)(2013)(2013) (0096) 96009600
151 (0x97)(2014)(2014) (0097) 97009700
152 (0x98)˜(02DC)˜(02DC) (0098) 98009800
153 (0x99)™(2122)™(2122) (0099) 99009900
154 (0x9A)š(0161) (009A) (009A) 9A009A00
155 (0x9B)(203A)(203A) (009B) 9B009B00
156 (0x9C)œ(0153) (009C) (009C) 9C009C00
157 (0x9D) (009D) (009D) (009D) 9D009D00
158 (0x9E)ž(017E) (009E) (009E) 9E009E00
159 (0x9F)Ÿ(0178) (009F) (009F) 9F009F00
160 (0xA0) (00A0) (00A0) (00A0) A000A000
161 (0xA1)¡(00A1)¡(00A1)¡(00A1)¡A100A100
162 (0xA2)¢(00A2)¢(00A2)¢(00A2)¢A200A200
163 (0xA3)£(00A3)£(00A3)£(00A3)£A300A300
164 (0xA4)¤(00A4)(20AA)¤(00A4)¤A400A400
165 (0xA5)¥(00A5)¥(00A5)¥(00A5)¥A500A500
166 (0xA6)¦(00A6)¦(00A6)¦(00A6)¦A600A600
167 (0xA7)§(00A7)§(00A7)§(00A7)§A700A700
168 (0xA8)¨(00A8)¨(00A8)¨(00A8)¨A800A800
169 (0xA9)©(00A9)©(00A9)©(00A9)©A900A900
170 (0xAA)ª(00AA)×(00D7)ª(00AA)ªAA00AA00
171 (0xAB)«(00AB)«(00AB)«(00AB)«AB00AB00
172 (0xAC)¬(00AC)¬(00AC)¬(00AC)¬AC00AC00
173 (0xAD)­(00AD)­(00AD)­(00AD)­AD00AD00
174 (0xAE)®(00AE)®(00AE)®(00AE)®AE00AE00
175 (0xAF)¯(00AF)¯(00AF)¯(00AF)¯AF00AF00
176 (0xB0)°(00B0)°(00B0)°(00B0)°B000B000
177 (0xB1)±(00B1)±(00B1)±(00B1)±B100B100
178 (0xB2)²(00B2)²(00B2)²(00B2)²B200B200
179 (0xB3)³(00B3)³(00B3)³(00B3)³B300B300
180 (0xB4)´(00B4)´(00B4)´(00B4)´B400B400
181 (0xB5)µ(00B5)µ(00B5)µ(00B5)µB500B500
182 (0xB6)(00B6)(00B6)(00B6)B600B600
183 (0xB7)·(00B7)·(00B7)·(00B7)·B700B700
184 (0xB8)¸(00B8)¸(00B8)¸(00B8)¸B800B800
185 (0xB9)¹(00B9)¹(00B9)¹(00B9)¹B900B900
186 (0xBA)º(00BA)÷(00F7)º(00BA)ºBA00BA00
187 (0xBB)»(00BB)»(00BB)»(00BB)»BB00BB00
188 (0xBC)¼(00BC)¼(00BC)¼(00BC)¼BC00BC00
189 (0xBD)½(00BD)½(00BD)½(00BD)½BD00BD00
190 (0xBE)¾(00BE)¾(00BE)¾(00BE)¾BE00BE00
191 (0xBF)¿(00BF)¿(00BF)¿(00BF)¿BF00BF00
192 (0xC0)À(00C0)ְ(05B0)À(00C0)ÀC000C000
193 (0xC1)Á(00C1)ֱ(05B1)Á(00C1)ÁC100C100
194 (0xC2)Â(00C2)ֲ(05B2)Â(00C2)ÂC200C200
195 (0xC3)Ã(00C3)ֳ(05B3)Ã(00C3)ÃC300C300
196 (0xC4)Ä(00C4)ִ(05B4)Ä(00C4)ÄC400C400
197 (0xC5)Å(00C5)ֵ(05B5)Å(00C5)ÅC500C500
198 (0xC6)Æ(00C6)ֶ(05B6)Æ(00C6)ÆC600C600
199 (0xC7)Ç(00C7)ַ(05B7)Ç(00C7)ÇC700C700
200 (0xC8)È(00C8)ָ(05B8)È(00C8)ÈC800C800
201 (0xC9)É(00C9)ֹ(05B9)É(00C9)ÉC900C900
202 (0xCA)Ê(00CA)ֺ(05BA)Ê(00CA)ÊCA00CA00
203 (0xCB)Ë(00CB)ֻ(05BB)Ë(00CB)ËCB00CB00
204 (0xCC)Ì(00CC)ּ(05BC)Ì(00CC)ÌCC00CC00
205 (0xCD)Í(00CD)ֽ(05BD)Í(00CD)ÍCD00CD00
206 (0xCE)Î(00CE)־(05BE)Î(00CE)ÎCE00CE00
207 (0xCF)Ï(00CF)ֿ(05BF)Ï(00CF)ÏCF00CF00
208 (0xD0)Ð(00D0)׀(05C0)Ð(00D0)ÐD000D000
209 (0xD1)Ñ(00D1)ׁ(05C1)Ñ(00D1)ÑD100D100
210 (0xD2)Ò(00D2)ׂ(05C2)Ò(00D2)ÒD200D200
211 (0xD3)Ó(00D3)׃(05C3)Ó(00D3)ÓD300D300
212 (0xD4)Ô(00D4)װ(05F0)Ô(00D4)ÔD400D400
213 (0xD5)Õ(00D5)ױ(05F1)Õ(00D5)ÕD500D500
214 (0xD6)Ö(00D6)ײ(05F2)Ö(00D6)ÖD600D600
215 (0xD7)×(00D7)׳(05F3)×(00D7)×D700D700
216 (0xD8)Ø(00D8)״(05F4)Ø(00D8)ØD800D800
217 (0xD9)Ù(00D9)(F88D)Ù(00D9)ÙD900D900
218 (0xDA)Ú(00DA)(F88E)Ú(00DA)ÚDA00DA00
219 (0xDB)Û(00DB)(F88F)Û(00DB)ÛDB00DB00
220 (0xDC)Ü(00DC)(F890)Ü(00DC)ÜDC00DC00
221 (0xDD)Ý(00DD)(F891)Ý(00DD)ÝDD00DD00
222 (0xDE)Þ(00DE)(F892)Þ(00DE)ÞDE00DE00
223 (0xDF)ß(00DF)(F893)ß(00DF)ßDF00DF00
224 (0xE0)à(00E0)א(05D0)à(00E0)àE000E000
225 (0xE1)á(00E1)ב(05D1)á(00E1)áE100E100
226 (0xE2)â(00E2)ג(05D2)â(00E2)âE200E200
227 (0xE3)ã(00E3)ד(05D3)ã(00E3)ãE300E300
228 (0xE4)ä(00E4)ה(05D4)ä(00E4)äE400E400
229 (0xE5)å(00E5)ו(05D5)å(00E5)åE500E500
230 (0xE6)æ(00E6)ז(05D6)æ(00E6)æE600E600
231 (0xE7)ç(00E7)ח(05D7)ç(00E7)çE700E700
232 (0xE8)è(00E8)ט(05D8)è(00E8)èE800E800
233 (0xE9)é(00E9)י(05D9)é(00E9)éE900E900
234 (0xEA)ê(00EA)ך(05DA)ê(00EA)êEA00EA00
235 (0xEB)ë(00EB)כ(05DB)ë(00EB)ëEB00EB00
236 (0xEC)ì(00EC)ל(05DC)ì(00EC)ìEC00EC00
237 (0xED)í(00ED)ם(05DD)í(00ED)íED00ED00
238 (0xEE)î(00EE)מ(05DE)î(00EE)îEE00EE00
239 (0xEF)ï(00EF)ן(05DF)ï(00EF)ïEF00EF00
240 (0xF0)ð(00F0)נ(05E0)ð(00F0)ðF000F000
241 (0xF1)ñ(00F1)ס(05E1)ñ(00F1)ñF100F100
242 (0xF2)ò(00F2)ע(05E2)ò(00F2)òF200F200
243 (0xF3)ó(00F3)ף(05E3)ó(00F3)óF300F300
244 (0xF4)ô(00F4)פ(05E4)ô(00F4)ôF400F400
245 (0xF5)õ(00F5)ץ(05E5)õ(00F5)õF500F500
246 (0xF6)ö(00F6)צ(05E6)ö(00F6)öF600F600
247 (0xF7)÷(00F7)ק(05E7)÷(00F7)÷F700F700
248 (0xF8)ø(00F8)ר(05E8)ø(00F8)øF800F800
249 (0xF9)ù(00F9)ש(05E9)ù(00F9)ùF900F900
250 (0xFA)ú(00FA)ת(05EA)ú(00FA)úFA00FA00
251 (0xFB)û(00FB)(F894)û(00FB)ûFB00FB00
252 (0xFC)ü(00FC)(F895)ü(00FC)üFC00FC00
253 (0xFD)ý(00FD)(200E)ý(00FD)ýFD00FD00
254 (0xFE)þ(00FE)(200F)þ(00FE)þFE00FE00
255 (0xFF)ÿ(00FF)(F896)ÿ(00FF)ÿFF00FF00

 

Do Large Values / LOBs Work?

We now know what this function does to the bytes that are passed in. But how many bytes are allowed to be passed in? Some functions do not work with the MAX types. Could this be one of them? That should be easy to find out:

DECLARE @LargeValue VARBINARY(8000);
SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4000));
SELECT DATALENGTH(@LargeValue) AS [InputSize],
       DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize];
-- InputSize    OutputSize
--      4000          8000
SET @LargeValue = CONVERT(VARBINARY(8000), REPLICATE('a', 4001));
SELECT DATALENGTH(@LargeValue) AS [InputSize],
       DATALENGTH(UNCOMPRESS(@LargeValue)) AS [OutputSize];
-- InputSize    OutputSize
--      4001          NULL

Nope. While the input type is not a MAX type, the output type would have needed to have been NVARCHAR(MAX) in order to have held the 8002 bytes taken up by the 4001 characters.

 

Microsoft Access 2000 “Unicode Compression”

Mitch Schroeter suggested to me that perhaps the UNCOMPRESS function was intended to work on data coming directly from Access 2000 (or newer) and compressed via the WITH COMPRESSION option of the CREATE TABLE statement. The documentation for CREATE TABLE (for Microsoft Access, not SQL Server) states the following towards the end of the Remarks section:

The WITH COMPRESSION attribute was added for CHARACTER columns because of the change to the Unicode character representation format. Unicode characters uniformly require two bytes for each character. For existing Microsoft Jet databases that contain predominately character data, this could mean that the database file would nearly double in size when converted to the Microsoft Access database engine format. However, Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte. If you define a CHARACTER column with this attribute, data will automatically be compressed as it is stored and uncompressed when retrieved from the column.

While this does sound similar, it is not the exact same compression that the UNCOMPRESS function expects. There is some overlap in the behavior, but the UNCOMPRESS function is more simplistic than Access’s “Unicode Compression” (that term is in quotes because it is not true Unicode Compression). If Access was doing nothing more than removing the “0x00” bytes, then there would be no way to determine when to add them back in upon uncompressing; very few of the 65,536 two-byte code points have 0x00 bytes, so any algorithm will need to deal with non-compressible code points. There needs to be an indicator of some sort to tell the parser when a byte should be prefixed with a 0x00, appended with 0x00, or left alone. For example, if it encounters two bytes — 0xD5E2 — should the next two bytes of output be: 0x00D5, 0xD500, or 0xD5E2? We need more info to figure this out.

I found a technical description of the algorithm used to do the Microsoft Access style “Unicode Compression” and the resulting format. It’s described in this ticket for a Java-based MS Access library:

https://sourceforge.net/p/jackcess/bugs/111/

But that code at the top is from another library, libmdb, whereas the ticket is for Jackcess. The Jackcess implementation can be found here:

https://sourceforge.net/p/jackcess/code/HEAD/tree/jackcess/trunk/src/main/java/com/healthmarketscience/jackcess/impl/ColumnImpl.java#l1483

Between those two algorithms, it is clear that the 0x00 byte is used to toggle compression on and off. Here are the relevant lines of code in Jackcess (toggling handled by the three highlighted lines):

      boolean inCompressedMode = true;
      while(dataEnd < data.length) {
        if(data[dataEnd] == (byte)0x00) {
          // handle current segment
          decodeTextSegment(data, dataStart, dataEnd, inCompressedMode,
                            textBuf);
          inCompressedMode = !inCompressedMode;
          ++dataEnd;
          dataStart = dataEnd;
        } else {
          ++dataEnd;
        }

That is the indicator I was referring to. Knowing how the algorithm works, we can create a test to see if UNCOMPRESS can handle that format:

SELECT UNCOMPRESS(0x31323334616263313233);
-- 1234abc123
SELECT UNCOMPRESS(0x313233340061626300313233);
-- 1234
SELECT UNCOMPRESS(0xFFFE313233340061626300313233);
-- ÿþ1234

As you can see, it cannot. Without or without the “is the string compressed or not” indicator set (the 0xFFFE as the first two bytes in the third test, which is how Access knows if the data is compressed or not, and just happens to be the Byte Order Mark for UTF-16 LE), it interprets the 0x00 byte as the “null” string terminator (\0 in many languages) instead of turning off the compression and interpreting the next 2 bytes as a single UTF-16 code unit.

Side notes about MS Access “Unicode Compression”, not related to UNCOMPRESS

  1. This Access approach to compression uses 0x00 bytes to toggle compression on and off, and the 0x00 bytes have been removed from the original data so this should be a reliable process, right? But what about Unicode code points in which the second byte is 0x00, such as code point U+5000 (i.e. “倀”)? If that 0x00 byte remains, then that would greatly complicate the parsing. But Access can store Unicode characters, so this scenario is being handled in some way. To figure out how, I created a test Access DB, created a table with a text field, and entered the following value (containing U+5000):

     

    Bob***ŴĤŶ倀健伷倀Œ***

    When I viewed the contents of the Access .accdb file, I saw the following bytes stored for that value:

    42006F0062002A002A002A0074012401760100506550374F005050012A002A002A00
    Bob***ŴĤŶ倀倀Œ***

    What’s interesting about that sequence of bytes is that it is not compressed (this field did not start with 0xFFFE). Yet entering the same string without the two U+5000 characters will show up as compressed in the data file (this field did start with 0xFFFE). This means that the MS Access way of handling strings containing code points in which the second byte is 0x00 is to simply not handle them.

  2. The documentation quoted at the beginning of this section mentioned: “Unicode representation of many character sets, those formerly denoted as Single-Byte Character Sets (SBCS), can easily be compressed to a single byte“. This is incorrect. Yes, Unicode does represent many character sets. However, there are not multiple character sets that are being compressed into a single byte by this method of compression. As we saw earlier, in the Which Bytes Are Converted to UTF-16? section, only the first 256 code points can be compressed, and those are the same as the ISO-8859-1 Single Byte Character Set (which again, is roughly the same as the Windows-1252 character set). Yet the Hebrew characters, also from a Single Byte Character Set (Windows-1255), all mapped to code points above U+00FF and cannot be compressed (in this method). So, only a single character set, Latin 1, can be compressed (using this algorithm).

 

Official Unicode Compression

Just to be clear, the compressed format that UNCOMPRESS is working with is, again, a proprietary compression technique that is not in any way related to actual Unicode Compression. The official Unicode compression is described in “Unicode Technical Standard #6:

A STANDARD COMPRESSION SCHEME FOR UNICODE“. I did test with the Example 9.2 data from that Unicode.org page:

DECLARE @Russian VARBINARY(50) = 0x129CBEC1BAB2B0;
SELECT DATALENGTH(@Russian) AS [SourceByteCount],
       DATALENGTH(UNCOMPRESS(@Russian)) AS [UncompressedByteCount],
       CONVERT(VARBINARY(50), UNCOMPRESS(@Russian)) AS [UncompressedHex],
       UNCOMPRESS(@Russian) AS [UncompressedCharacters];
/*
SourceByteCount:            7
UncompressedByteCount:     14
UncompressedHex:           0x12009C00BE00C100BA00B200B000
UncompressedCharacters:    <DC2>œ¾Áº²°
*/

The uncompressed value is supposed to be: Москва

 

Conclusion

Observations

  1. For some bytes, the “CP 1255” value is the same as the “UNCOMPRESS” value.
  2. For most bytes, the “CP 1252” value is the same as the “UNCOMPRESS” value. It’s only characters in the 0x800x9F range that are different. That range is the C1 Control characters (more info: C1 Controls and Latin-1 Supplement and Latin-1 Supplement 0080—00FF).
  3. For all bytes, the “UTF-16” value is the same as the “UNCOMPRESS” value.
  4. This only works on the first 256 Unicode Code Points, so it is a very limited, simple form of compression.
  5. This function is limited to returning at most 4000 characters / 8000 bytes.
  6. There is no built-in function to produce the “compressed” binary from valid UTF-16 data.

Lessons Learned

The UNCOMPRESS function does nothing more than add a 0x00 byte to each byte passed in, the result of which is valid UTF-16 Little Endian (i.e. NVARCHAR ) data. Given the various limitations of this function, and the fact that you would have to write your own function to “compress” NVARCHAR data into this format (a simple CONVERT won’t work unless you can guarantee that none of the characters found in the 0x800x9F range exist in the input data):

DO NOT USE THIS FUNCTION!!

If you need compression because your data is mostly Code Points in the range of 0 – 255 / U+0000 – U+00FF / 0x000xFF, then look into Data Compression and Clustered Columnstore Indexes.

Does this function do essentially the same thing as UTF-8? Not really. UTF-8 only has the first 128 characters / Code Points (i.e. 0x000x7F) as single-byte characters. And if that is all your data is, then you can simply convert to VARCHAR. And while UTF-8 is supported natively starting in SQL Server 2019, there are limited scenarios where you should UTF-8 (within SQL Server, that is). And, even when you do save space, you will most likely sacrifice performance (to varying degrees). For more info on UTF-8 support in SQL Server, please see:

Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

The one potentially valid use for UNCOMPRESS is if you have ISO-8859-1-encoded data, that is already in binary format, and is no more 4000 bytes / characters. This would work due to the first 256 Unicode Code Points being the ISO-8859-1 character set. However, even if you have data that fits this description, it would be better to convert it to Unicode / UTF-16 Little Endian prior to importing it into SQL Server. Or, if the data does not contain any bytes in the range of 0x800x9F, then just import it into a VARCHAR column that is using any of the various collations associated with code page 1252.

Mentions of UNCOMPRESS()

  1. The Guru’s Guide to SQL Server Stored Procedures, XML, and HTML” by Ken Henderson. Copyright © 2002 by Ken Henderson.
  2. SQL Server 2000 Stored Procedures Handbook” by Robin Dewson, Louis Davidson, Tony Bain, Chuck Hawkins. Copyright © 2003 by Apress.
  3. Encrypt all objects” (SQLTeam forum) post by sterobhun on 2002-07-24 at 06:29:09
  4. Undocumented uncompress() function behaves strangely” (Azure Feedback)
  5. Undoing compression” (SQL Server Central forum)

  1. Technically, some of the undocumented built-in functions, such as CLOUD_DATABASEPROPERTYEX , that return errors similar to: Msg 195, Level 15, State 10, Line XXXXX

    'CLOUD_DATABASEPROPERTYEX' is not a recognized built-in function name. can be executed, but only from within a stored procedure marked as being a “system stored procedure”. However, system stored procedures are a topic for another day 😉. 

  2. Just to be clear: the pattern of Unicode Code Point matching the UTF-16 Big Endian encoding only holds true for the first 65,536 Code Points (U+0000 – U+FFFF). 

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating