Stairway to Data

Stairway to Data, Level 3: Strings

,

Character Sets

Characters are represented internally in the computer as bits, and there are several different systems for doing this. Almost all schemes use fixed length bit strings and the three most common ones in the computer trade are EBCDIC, ASCII and Unicode.

EBCDIC (Expanded Binary Coded Digital Information Code) was developed by IBM by expanding the old Hollerith punch card codes. It also uses a byte (8 bits) for each character. EBCDIC has fading out of use in favor of ASCII and Unicode, even with IBM. But there is a lot of legacy data in EBCDIC. This was not the only early character set; there was FieldData from Control Data Corporation, TTS for typesetting and dozens of others. IBM's domination pushed them out of all but special purpose hardware.

ASCII (American Standard Code for Information Interchange) is defined in the ISO 464 standard and it uses a byte (8 bits) for each character. This quickly became the most popular character set in use. It got dominance because of ANSI support and the rise of small computers. It also got extensions and versions for other languages.

The problem with ASCII extensions was that (1) The same numeric values had different meanings (2) the same symbol has different bit patterns. The new contender and certain winner is Unicode, a 16-bit system. The goal is for Unicode to provides a unique number (bit pattern) for every character, no matter what the platform, no matter what the program, no matter what the language.

One place to learn more about Unicode is the web site of the Unicode Consortium: http://www.unicode.org. The latest version of the Unicode Standard is Version 6.0.0. The documentation is located at http://www.unicode.org/versions/Unicode6.0.0/

Basic Terminology

An alphabet is a system of characters in which each symbol has a single sound associated with it. The most common alphabets in use today are Latin, Greek, Arabic and Cyrillic.

A syllabary is a system of characters in which each symbol has a single syllable associated with it. The most common syllabaries in use today are Korean and part of Japanese.

An ideogram system uses characters in which each symbol is a single word and Chinese is the only such system use today, however, other Asians borrow from the Chinese character set.

Unicode is the result of an international effort to represent all the alphabets, syllabaries and written language character sets in the world, including a few dead languages. Unicode uses two bytes per character (16 bits) and introduces a complete set of terminology for character names.

A "code point" is the place of a character within an encoding scheme. Thus, in all ISO standard character sets, upper case “A” is encoded as 0x41 in Hexadecimal -- 65 in Decimal -- so we say that A's code point is 65. This is an abstraction of the symbol and had nothing to do with font, size or other physical display attributes.

Within Unicode, there are some subsets that exist to allow the previous ASCII standards to be defined.

UTF-8 is popular for HTML and similar protocols. It maps to the familiar ASCII set so existing software works with it.

UTF-16 is used for NATIONAL CHARACTER (NCHAR) and NATIONAL VARYING CHARACTER (NVARCHAR) in SQL. It is a balance between a good range of characters with economical use of storage. All the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

Unicode covers all the major scripts in use today. It also has punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, and other symbols. If none of those work, you can overlay symbols using what are called normalized forms. In all, the Unicode Standard, Version 6.0 provides codes for 109,449 characters from the world's alphabets, ideograph sets, and symbol collections.

The mark made on screen or paper —- called a glyph —- is a visual representation of the character. The Unicode Standard does not define glyph images, only how characters are interpreted, not size, shape, and style of the glyphs when they are rendered.

Not every sequence of 8-bit values is a valid UTF-8 string. Not every sequence of 16-bit values is a valid UTF-16 string. Strings that are not valid are called "ill-formed".

When stored in the memory of a running program, UTF-16 code units are almost certainly stored in the native byte order of the machine. In files and when transmitted, two byte orders are possible. When byte order distinctions are important, the names UTF-16be (big-endian) and UTF-16le (little-endian) are used.

When a stream of text has a UTF-16 encoding form, and when its byte order is not known in advance, it is marked with a byte order mark. A byte order mark is the formatting character "zero-width no-break space" (U+FEFF ) occurring as the first character in the stream. By examining the first two bytes of such a stream, and assuming that those bytes are a byte order mark, programs can determine the byte-order of code units within the stream. When a byte order mark is present, it is not considered to be part of the text which it marks.

Collations

A collation is the sorting order for a set of symbols. There is an interesting history of the invention of alphabetical ordering. It did not catch on until the invention of the printing press and Robert Cowdray's dictionary in 1604. The basic collation for the modern Latin alphabet is:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Cicero never saw a 'W' or a lower-case 'j'; they got added in medieval manuscripts until we now have 26 letters. But these letters appear in all standard computer character sets including 7-bit ASCII and EBCDIC zlonb ith digits and some basic punctuation, even in character sets for languages which do not use Latin. They are needed for all ISO Standards.

After that, Latin collations can be

  1. Case sensitive. ('New York' <> 'NEW YORK') or case insensitive ('New York' = 'NEW YORK')
  2. Indifferent or aware of accented characters. ('Düsseldorf' = 'Dusseldorf') or ('Èclair' = 'Eclair'). France and French Canada do not capitalize accented letters the same way. In France the upper case of the word résumé is RESUME -- there are no capital letters with accents, which is why the old PC-DOS Character set (IBM extended ASCII) could dispense with the letters É, Ó, etc. But French Canadians capitalize the word résumé and keep the accents: RÉSUMÉ.
  3. Indifferent or aware of inserted punctuation. ('pot-holder' = 'potholder)' or ('datatype' = 'data type')
  4. Multi-character aware. Before 1997, Spanish restated 'ch' and 'll' as single separate letters. 'Mc' and 'Mac' appear before 'm' in older telephone books (sometimes M' does too, but only in Ireland).

German orthography reform of 1996 (Rechtschreibreform) preceded the Spanish Academy efforts by a year, but only the rules for spelling were changed. In the German DIN-1 standard, 'Ä' = 'A', 'Ö' = 'O' and 'Ü' = U'. In the German DIN-2 standard, 'Ä' = 'AE' , 'Ö' = 'OE' and 'Ü' = 'UE'. This is the standard used for phone books, voter lists, or any lists of names in Germany. Thus, 'Göbbels' = 'Goebbels' and 'Düsseldorf' = 'Duesseldorf'.However, Austria sets 'Ä' > 'AZ', 'Ö' > 'OZ' and 'Ü' > 'UZ'. That is, accented character are treated as a separate letter between A and B or O and P or U and V. The Swiss use DIN 1 for all sorting.

The Central European character set supports languages that use Latin-based characters with macrons, slashes, haceks, ogoneks, breves, bars and carons. Many of these do not map to a simple Latin letter when sorting.

The many extra letters created by adding these marks to the simple Latin letters made it impossible to fit them all into the 8-bit LATIN-1 character set, since there can only be a maximum of 256 different symbols in an 8-bit set. (In fact, LATIN-1 actually provides fewer than 256 characters because 0x00 to 0x1f and 0x7f to 0x9f are reserved for control "characters".) Therefore, the most convenient thing to do was group the Latin-based East European character sets together into LATIN 2.

The Nordic character set takes all the additional characters in Danish, Norwegian, Swedish and Finnish and puts them in one collation.

Cyrillic character sets can have different rules for Ukrainian, Bulgarian and Russian Serbian.

Chinese has at least 13,000 different characters, which are implemented with double byte character sets (usually "BIG5" in Taiwan, or "GB" in Red China); ideally the arrangement is that characters are sorted phonetically (Level 1) or according to radical and number of strokes (Level 2).

Japanese is the worst written language on Earth. They mix multiple writing systems. For example, the word for "water" can be a single character (Kanji), two characters (Hiragana), two characters (Katakana) or four Latin letters (Romaji). The double character sets are usually encoded using escape characters. That saves space, but makes it impossible to determine a text's length or find its nth letter without starting at the front and working through the text.

Problems of String Equality

Obviously collation and equality are related. But there is more to it than that. Different programming languages handle string comparisons differently. Some comparisons ignore the case of the letters, others do not. Some programming languages truncate the longer string, others pad the shorter string with blanks and some refuse to compare string of unequal length at all. For example, the Xbase family truncates, while SQL pads the shorter string, so 'Smith' and 'Smithsonian' are equal in Xbase and not equal in SQL.

Let's look at SQL's rules for mixing collations.

Default Collation

Databases are created with a Default Collation. In Standard SQL, there is a syntax for creating and naming a collation and there are some standard collation names. The standard names are usually national language standards like we have been discussing.

Implicit Collation

The collation of a string expression is the collation defined for the column in the table or view. You cannot override this in the base table to mix Esperanto, Chinese, Arabic and Swedish in one column. It is the same principle as allowing only one data type in a column.

However, one collation might be converted to another, just as one numeric data type can be cast to another when certain rules are met.

Explicit Collation

A string expression can be explicitly cast to a specific collation by using a COLLATE clause after the expression, <string expression> COLLATE <collation name>.

Again, common sense applies and I cannot cast Chinese into Greek. Well, not quite. I can have a custom translation between character sets in Standard SQL, character by character.

No Collation

It is not always possible to resolve an expression to a collation.

The mixed collation conversion rules are intuitive and follow the pattern for data types. Explicit conversion takes precedence over implicit conversion. Implicit conversion takes precedence over default collation. You get an error when you try to combine two different explicit collation expressions. That means you cannot look for Greek inside a Chinese column. Combining two different implicit collations returns no collation. Combining an expression with no collation with an expression of anything propagates to no collation.

Let's take a quick look at basic string functions.

Length

Remember that with Unicode, it is no longer true that one character uses exactly one byte to represent it. Standard SQL has three versions of the string length function; one which counts the bytes, one which counts bits and one which counts characters. Go back and look at the DIN-1 and DIN-2 collations; the same word can have different lengths ('Düsseldorf' versus 'Duesseldorf').

Empty String versus NULL String

In ANSI/ISO Standard SQL, the NULL string is not the same as the empty string. The usual rule that "NULLs propagate" holds for strings as well as for numerics and temporal expressions.

Another interesting property is that CHAR(n) has to be padded out with blanks to the full length of their column. The VARCHAR(n) data types have to have (n >= 1) and cannot be less than one character in length. Even a VARCHAR(n) string has to be padded out with blanks to bring it to the minimum length of one. Thus, any attempt to store an empty string in an SQL column will be padded out with a blank and you cannot store an empty string in SQL.

Concatenation

Concatenation is the operation of attaching one string to another to produce a result. The Standard SQL notation is taken from the PL/I programming language and consists of the infixed operator ||; T-SQL still has the original Sybase overloaded + syntax. Since the rest of the string operators are function calls, concatenation has the lowest precedence.

Position

This function appears in many different forms in programming languages, but it is always designed to find the starting position of one string (usually called the search string) inside another string (usually called the target string). T-SQL uses CHARINDEX(), but you will also see INDEX() and LOCATE() in other products. The Standard SQL syntax for this function has the embedded keyword pattern that is characteristic of SQL:

POSITION (<character value expression>

IN <character value expression>)

and it returns a non-negative integer result, with a zero meaning that the search string was not a of the target string. SQL has to add some more rules to handle NULLs and national character sets.

The rules for POSITION can be summarized in these examples:POSITION ('abc' IN 'abcdef') produces 1

POSITION ('xyz' IN 'abcdef') produces 0

POSITION ('' IN 'abcdef') produces 0

POSITION ('abc' IN '') produces 0

POSITION ('' IN '') produces 1

POSITION ('abc' IN NULL) produces NULL

POSITION (NULL IN 'abcdef') produces NULL

POSITION (NULL IN NULL) produces NULL

Substring()

The substring function returns a set of contiguous characters from a source string. It takes three parameters; the source string, the numeric starting position and then either the ending position or a displacement from the starting position. All the numeric values are non-negative.

Pattern Matching

Standard SQL has a simple pattern marching predicate, the [NOT] LIKE Boolean operator. It takes a match value string expression and a pattern , and returns true if a substring can be matched to the pattern. The characters '%' (percent) and '_' (underscore) have special meaning when they occur in the pattern.

The optional third argument is a character string containing exactly one character, known as the "escape character", for use when a percent or underscore is required in the pattern without its special meaning. The ESCAPE character, if it is given, must be followed by a percent sign, an underscore or another ESCAPE character. The leading ESCAPE character is ignored for matching purposes and the character that follows it is now treated as a character literal with no special properties.

The syntax is:

<like predicate> ::=

<match value> [NOT] LIKE <pattern>

[ESCAPE <escape character>]

The basic rules have no great surprises. The strings have to be in the same collations. The expression "<match value> NOT LIKE <pattern>" is equivalent to "NOT (<match value> LIKE <pattern>)", as with other SQL constructs. If any of the three strings is NULL, then the predicate returns an UNKNOWN as its result.

Formally defining LIKE is trickier than it would seem at first.

Each "un-escaped" underscore character represents an arbitrary character and each "un-escaped" percent character represents an arbitrary string, possibly empty. All other characters represent whatever character they are.

If <match value> and <pattern> are character strings whose lengths are variable and if the lengths of both <match value> and <pattern> are 0, then the result is TRUE.

The LIKE predicate is TRUE if there exists a partitioning of <match value> into substrings such that:

  1. You can map every un-escaped single character in the pattern to the same character in the <match value>.
  2. You can map every underscore wildcard character in the pattern to some character in the <match value>.
  3. You can map every percent wildcard character in the pattern to a of zero or more contiguous characters in the <match value>.
  4. The number of substrings in the partition of <match value> is equal to the number of specifiers in <pattern>.
  5. otherwise, the result is FALSE.

The following example might help make this clear.

'Book' LIKE 'B%k' is TRUE

because 'B' maps to 'B',

'%' maps to 'oo',

'k' maps to 'k',

'Block' LIKE 'B%k' is TRUE

because 'B' maps to 'B',

'%' maps to 'loc'

'k' maps to 'k',

'Bk' LIKE 'B%k' is TRUE

because 'B' maps to 'B',

'%' maps to '',

'k' maps to 'k',

'Blob' LIKE 'B%k' is FALSE

because 'k' has nothing to map onto

'20% off' LIKE '20/% %' ESCAPE '/' is TRUE

because '2' maps to '2',

'0' maps to '0'

'%' maps to '%' via an escape

' ' maps to ' '

'%' maps to 'off'

Thoughts on Using Characters for Data

I hope that you now have a feeling for how complicated character data can be. If you store text in a particular language you probably ought to be using a document management system and not SQL.

If you use characters for encoding schemes in a database, then try to restrict yourself to LATIN-1. Within LATIN-1, use letters, digits and minimal punctuation marks. The collation is guaranteed and the characters are portable across all Unicode character sets.

This article is part of the parent stairway Stairway to Data

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating