Blog Post

TechEd - Designing a Global Database

,

I was going to hit a Powershell lab because the Scripting Guys at TechNet write great columns, but it filled up, so I went to my second choice: Building  a Global Database.

 

What does this mean? If you are building a database that handles all types of languages and cultures, you have to handle not only characters in storage, but also display. The formatting and layout is different, so you need to ensure that you have "cultural integrity" with your data.

The Storage Perspective

What does it mean to be multi-cultural?

Really it means Unicode. There are two types of data (nchar, nvarchar), but also encoding. UCS-2 encoding has2 byte characters, fixed width, 16bit. This is what SQL Server uses. If you were to use char/varchar, these are not Unicode, but the storage is defined by the code page of the table.

SQL Server 2008 added the new data and time types. There's an ISO8601 support with DATEPART to handle global applications. There is an ISO_WEEK parameter added to datepart.Timezoneoffset was also added, this is from UTC and it's globally aware, stored as a ETC date.  

A code example was given to check if a date was UTC compliant. It's a good idea, especially as more and more of us deal with global data.

SET LANGUAGE gives you the error messages and date time formats in the appropriate language on the server. ISO8601 formatting isn't set by this.  The client uses it's own parameters in the connection to handle it's own formatting.

One of the changes in SQL Server 2008 with language is the enhancements to Full-Text search that added many languages (51 out of the box) to the Word Breakers used for searching in various languages. The Word Breakers were developed by the Office team, so there is compatability between products (including Vista). In the older versions, there were 11 lists of word breakers.

Collations

Collations are fundamental to all operations, so you should be aware of how they work. The collation defines the sort order and code pages used. Sorting are based on cultures and liguistics. Codepages specify a specific set of characters that are recognized. 

Windows collations provide more information than SQL collations.

Collations include various aspects such as Case sensitivity/insensitivity, Accent sensitivity/insensitivity, Kana sensitivity/insensitivity, and width sensitivity/insensitivity. Kana sorting is an option for all collations, but basically set for Japanese. Width means that you handle double byte the same as bytes when sorting. So A in one byte (char) equals A in double byte (nchar)

 Case sensitivity can be set at different levels. There is an instant level collation, a database level, and then an object level. This matters since users, passwords, etc. are compared according to the instance level setting.

SQL Server Collations

 There are two types of binary sorting.  _BIN sorts based on binary code point. This has the fastest performance, but works with fixed 2 byte characters, so Unicode 16 (4 byte) might not sort correctly. _BIN2 handles sorting with Unicode code points, so double Unicode (4 byte) characters are sorted correctly.

One big thing is that collations are set in a tab during setup. THIS IS EASY TO MISS, so if collation is important, go slow here. It's not obvious where this is.

tempdb collations is based on model, so change that if you need to do this. There is a problem if you need multiple collations. So temp objects need to be declared with the collation in the columns or table if it is required.

Collations can be changed on databases, columns, but this changes meta data, so interpertation of your data could change. Be very careful of altering collations, but you have the choice of T-SQL or SSMS. You can also change data from non-Unicode to Unicode.

Collation can be chosen on the fly in a select statement. (select * from T order by C COLLATE Japanese_CI_AS). 

There is an interesting chart on how collation conflicts are handled. Two explicit declarations can error out with they conflict, but explicit declarations trump the implicit declarations. There is a chart in BOL (SQL Server 2008) to determine how other situations are handled.

The collations have been aligned with Windows 2008, so 80 new Windows collations. Weights have been added and updates to Japanese, Chinese, etc. in terms of the changes to the languages based on government standards. There have been quite a few minority scripts in Chinese (Yi, Uighur, Mongolian, Tibetan) as well as supplementary characters (4 byte Unicode). Collations are indicated by compatability level (90 and 100).

An impressive list of collations shown on a slide that showed the 2000 collations, about 35 or so, about 6 updates in 2005, and then 6 new ones in 2005, but almost a doubling of collations in 2008. SQL Server is going global.

Demos: create table, change types. There is an SSMS setting that prevents changes that require table creation. That's a great safety net I wasn't aware of in the tool.

Inserting Unicode characters: some display, some show the square box as they're not displayable in SSMS, and using the Hex code. In three inserts, the interesting thing is how non-Unicode data is handled. The insert is smart enough to break up hex values into the correct number of characters, but you can get stange data.

Stroke order is another sorting that matters in some languages, and support is added to match the dictionaries in those cultures where this matters.

There is a tempdb demo thos shows some of the issues with different collations. To me, if SQL Server wants to be a truly global player, this is a reason why there need to be multiple tempdbs.  The CASE function does need an explicit collation in the query since it's not aware of the collations.

Best Practices 

Use Unicode throughout the stack (end to end). If this isn't possible, be sure that conversion points are well known.

- In the ODBC flow, if the client has SQL_C_CHAR and the server has CHAR, different code pages, these need to be converted. With auto-translater, the client converts from Client Code page to Unicode. Next the server code page is used to convert from Unicode to the server page. This is client side, prior to going over the wire, so the client must be aware of the server code page and have it installed.

If the client is char, server is unicode, one conversion. Same is the client nchar, server char. You still have have issues here.

No conversions are made for surrogate characters (4 byte). These are stored as 2 separate 2-byte characters, so all handling must be done with clients. 

The biggest danger here is a Unicode client working with a non-Unicode client. This could result in data loss and hopefully people working with multiple langages are sing Unicode storage only.

There are a few white papers for Best Practices on globalization with changing collations and moding to a Unicode world.. 

Kind of a basic session in some ways. They covered some basics that even a language novice such as me is aware of, but there wasn't a lot of maketing in terms of SQL Server 2008. It does show that there is much more support in 2008, so if you need that support, think about SQL Server 2008.

 

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating