Beware of Mixing Collations - Part 1
by Gregory A. Larsen
A few months back I was running a stored procedure (SP) that performed a cross database joins on our SQL Server 2000 box. This particular procedure was processing records from a table in one database (database A), that was originally created on our SQL Server 2000 box, while also processing records from a table in another database (database B), that was also on our SQL Server 2000 box, but was created by restoring a SQL Server 7.0 database backup. The SQL Server 7 machine had a different character set and sort order than our SQL Server 2000 database. The SELECT statement the SP was running was joining records from the table in database A with the table in database B on a varchar key field. Can you imagine what happened? The SP failed with a "collation error".
SQL Server 2000 finally solved the problem of taking database backups from servers with different character set / sort order and restoring them, without getting a collation error. Before SQL Server 2000, backups from one server could only be restored on a different server if the same character set and sort order where used by both servers. With SQL Server 2000 you are able to set the collating sequence at the server, database and even the column level. Now that you can have multiple collating sequences on a single database server, there are a few things you should understand about mixing collations.
When you have character data stored with different collation sequences, you need to be aware that comparing columns becomes more involved. Columns with the same collating sequences can be compared or joined without any additional considerations. Although when you compare columns that have different collating sequences it require some special coding considerations.
Let’s look at an example that will demonstrate what I am talking about. The example below tries to join data from two different tables on columns that have different collating sequences.
create table #a (char_set1 varchar(50) collate Latin1_General_CI_AS)
create table #b(char_set2 varchar(50) collate Latin1_General_BIN)
insert into #a values ('collate')
insert into #b values ('collate')
select * from #a join #b on char_set1 = char_set2
When the code above is executed the select statement returns the following error:
Cannot resolve collation conflict for equal to operation.
Why does the SELECT statement get this error? This error happens because char_set1 column in table #a has collation of Latin1_General_CI_AS, where as char_set2 column in table #b uses a collation of Latin1_General_BIN.
Of course SQL Server allows for you to write code so you can join the above two tables, even though the join columns each have a different collation setting. By changing the SELECT statement so that one of the columns does an explicit collation conversion prior to joining resolves the issue shown above. Below char_set2, which has a collation setting of Latin1_General_BIN is explicitly converted to Latin1_General_CI_AS using the COLLATE clause:
select * from #a join #b on char_set1 = char_set2 collate Latin1_general_CI_AS
By using this explicit conversion from one collation to another you can then join both tables. Now if you really have to store and compare columns with different collation then SQL Server is great at providing you this functionality. Although if you have no business requirements to store data with different collations then by all means don’t. Storing data of varying collations causes additional unnecessary programming considerations therefore choose your collation requirements wisely.
In our environment we have a lot of contractors doing development. These contractors typically develop our databases on their own SQL Server box. Their server and/or database collation setting may or may not be the same as ours. Occasionally a contractor will send me a database backup and tell me to restore it onto our test server. This is easily done with SQL Server 2000, regardless of the collation setting on the developer’s machine. Every so often when I do this, I find that the restored database has a different collation setting then our server. When this happens I have to convert the data in the restored database to our standard collation (this another topic that I will discuss in part 2). I do this conversion because we require all our data to have the same collation setting. If you are restoring databases from other SQL Server machines, be aware that you might be restoring data that has a different collation setting then your server.
I’m going to leave you with one more example why knowing your collating setting is extremely import.
The TEMPDB database is created each time the server is rebooted. TEMPDB has the same default collation sequence as the model DB. If for some reason you have a SQL Server 2000 database that has a different sort sequence than your current TEMPDB collation setting, then you potentially might have a problem when comparing data between your application and TEMPDB.
Here is an example of the TEMPDB comparison I am talking about. This example creates database test with a collation of Latin1_General_BIN, then creates and populates table b with data. Next it builds table #temp in TEMPDB and inserts records into it from table b. The columns in TEMPDB use the model db collation sequence settings, which is in my case is Latin1_general_CI_AS.
create database test collate Latin1_General_BIN
Create table b(char_set1 varchar(50))
insert into b values ('collate')
CREATE TABLE #TEMP (CHAR_SET2 VARCHAR(50))
insert into #temp select * from b
select * from #temp
select * from #temp t join b on t.char_set2 = b.char_set1
When I run the above code I get the collation conflict error on the second SELECT statement. Where as if the two databases had the same collation sequence this error would not occur. If I change the second SELECT statement to read as follows the join will work without a collation conflict :
select * from #temp t join b on t.char_set2 collate Latin1_General_BIN = b.char_set1
There are a couple different ways to determine your default sort order and character set. The sp_helpsort SP can be used to determine the character set / sort order of your server. For determining the database character set / sort order use sp_helpdb. Now if you also what to know the sort order / character set for each column in a database than use the information_schema.columns view to display this information.
Being able to restore databases, onto a SQL Server 2000 box, without requiring the target server to have the same character set and sort order as the source server is a big improvement over SQL Server 7.0. Although you need to be aware of the collation problems that might occur if your SQL Server 2000 machine contains databases and columns with varying collations. If you really don’t have a business requirement to store data with different collations, then verify collation information every time you restore a database where you are unsure of the collation setting on the source server/database. Bottom line is you better know your column collation settings in order to avoid the collation conflict error message.