Potential issues with different collations

  • Hi all,

    What are the potential implications (if any) for a database having a different collation than a server for the query engine itself? For instance, I know if *I* want to use tempdb for a temp table then I need to use COLLATE statements. But what about a query, which only joins tables from on one database that the engine is executing? Are there performance implications for the server itself when tempdb is a different collation than the database it's executing queries on?

    Any input is appreciated.

  • Collation , will impact sort operations and as a result performance . Additionally collation will impact the results being returned and as a result the join and type of join operations being performed.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (6/24/2011)


    Collation , will impact sort operations and as a result performance . Additionally collation will impact the results being returned and as a result the join and type of join operations being performed.

    Can you provide anymore details specific to my example?

    For example if tempdb uses collation A and database uses collation B, and I run the following query: "SELECT a.col1, a.col2 from table1 a INNER JOIN table 2 b on a.col3 = b.col3 ORDER BY a.col1" I assume my results are still sorted as per collation B? How is performance affected for queries.

    The two collations in question here are SQL_Latin1_General_CI_AS and the windows Latin1_General_CI_AS (not at work right now, those may not be exact), I assume they are close enough that sort order won't be an issue?

  • Hi,

    Sorting done based on the character. e.g. if you run the sort "order by " using the two different datatype column then you get different result. So insert some date in one column using the US date format and UK date format and run the order by column. You get the different sorting.

    So you data value and position of the character differs in different Collation.

  • Does anyone else have any input?

  • http://msdn.microsoft.com/en-us/library/aa214297%28v=sql.80%29.aspx

    Read this for a better understanding on collation

    Jayanth Kurup[/url]

  • Jayanth_Kurup (6/24/2011)


    http://msdn.microsoft.com/en-us/library/aa214297%28v=sql.80%29.aspx

    Read this for a better understanding on collation

    Thank you, but I think there is a misunderstanding on what I am asking, perhaps this is my fault. It is not that I don't understand collations, it is that I don't understand the performance implications for the SQL query engine when it is processing a query that uses (only) tables from database A but internally uses tembdb during processing which uses the different collation, and specifically for SQLServer 2008. My original question (reworded) was:

    But what about a query, which only joins/processes tables from one database? Are there performance implications for the server itself when tempdb is a different collation than the database it's executing queries from?

    Hopefully that makes what I am seeking information on clearer.

  • New temp tables will use the default collation of tempdb unless otherwise specified.

    Here are a few options (there may be others) for you in terms of creating your temp tables so you do not get the default collation of tempdb, and rather get the collation that matches your particular columns.

    Option 1 uses the "database_default" keyword:

    CREATE TABLE #tmp

    (

    a VARCHAR(100) COLLATE database_default,

    b NVARCHAR(100) COLLATE database_default

    ) ;

    Option 2 explicitly specifies the collation:

    CREATE TABLE #tmp2

    (

    a VARCHAR(100) COLLATE Chinese_Traditional_Bopomofo_100_CS_AI_WS,

    b NVARCHAR(100) COLLATE Chinese_Traditional_Bopomofo_100_CS_AI_WS

    ) ;

    Option 3 makes use of the SELECT...INTO technique of creating a new temp table which will instruct SQL Server to derive the temp table definition from the selected columns. Collations are part of a table's definition.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thanks, but really not what I am asking about.

    For anyone else not sure what I am talking about take a look here: http://msdn.microsoft.com/en-us/library/ms190768.aspx

    I'm talking about potential performance issues of Internal objects that are created by the SQL Server Database Engine, for example, work tables to store intermediate results for spools or sorting. And specifically for 2008.

    I'm not creating any temp tables myself or manually interacting with tempdb in anyway. I'm asking about the performance hit (if any) of the server collation being SQL_Latin1_General_CI_AS and the db collation being the windows Latin1_General_CI_AS while the engine is resolving a query from only the single database in question where it decides to use tempdb (assuming it still would given the different collations). There is no explicit tempdb usage by the person executing the query.

    I really don't think I can make the question any clearer at this point.

  • I think I am the same page as you now, sorry, I missed your *I* inference in your OP.

    The intertubes did not turn much up for me, as I suspect it did not for you which is why you posted. I am forced to abstract a bit here but only for lack of documentation. Just thinking out loud really, someone may come along and tell me to shut it, or school me, or both, but I am thinking about how schema is derived when an "internal" temp table needs to be created...and what happens when you ask for a table to be created using SELECT...INTO. I suspect the same techniques would apply for internal tables that you do not explicitly create. If you explicitly compare two columns with different collations that's on you, but I can't see the engine swapping out collations like that on your behalf, after all, collations are part of the schema definition of the column.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • worktables use the collation as specified for the database in which the operation is being performed. So I dont think you would see any additional performance impact other than what you would see if the table was created outside of tempdb.

    the example I tired was to change the collation of my user database and create a table variable into which it inserted a million rows . On querying the tempdb temptables columns I found collation name to be the same as that of my user database instead of the database default for temp db. Hope this answers your question.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (6/25/2011)


    worktables use the collation as specified for the database in which the operation is being performed

    Can you cite an article?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • No, the Internet didn't really turn up much for me either unfortunately. Thanks for your comments, I'll bump this up on Monday and see if anyone else has any input then.

    As a thought, I wonder if the engine is smart enough to say realize there is a difference in collations between tempdb and the database and create any internal structures to use the db collation? Wouldn't really be a hard thing for it to check for and do. -- heh, just saw the second page of comments with this same thing.

  • raistlinx - Saturday, June 25, 2011 1:37 AM

    I'm talking about potential performance issues of Internal objects that are created by the SQL Server Database Engine, for example, work tables to store intermediate results for spools or sorting. And specifically for 2008.I'm not creating any temp tables myself or manually interacting with tempdb in anyway. I'm asking about the performance hit (if any) of the server collation being SQL_Latin1_General_CI_AS and the db collation being the windows Latin1_General_CI_AS while the engine is resolving a query from only the single database in question where it decides to use tempdb (assuming it still would given the different collations). There is no explicit tempdb usage by the person executing the query.

    Hi there. I am going to have to say "no" to the possibility of performance issues in this scenario. Database-level collation is used for:

    1. the contents of string literals and variables / parameters. Neither of those come into play here.
    2. the default collation used when creating / altering columns in a "CREATE TABLE" / "ALTER TABLE" statement when not using the "COLLATE" clause. That is also not happening here. Tables created from a "SELECT ... INTO..." use the collation of the field in the result set that is being used to create the table.
    3. database-level meta-data (e.g. object names, user names, etc). This is also a non-issue for this scenario
    Basically, if you create a table in some other DB that uses some other collation than the DB you are executing the query from (i.e. the active DB), then the most it could impact is column-name resolution (since that is meta-data which is controlled by the DB containing the object) and the collation of computed columns that don't specify the "COLLATE" clause. Below is an example of each of those cases:

    -- DB collation is Latin1_General_100_BIN2 ; tempdb collation is SQL_Latin1_General_CP1_CI_AS
    CREATE TABLE dbo.ColumnNames(col1 INT, Col1 INT, COL1 INT);
    DROP TABLE dbo.ColumnNames;

    DECLARE @Table TABLE (col1 INT, Col1 INT, COL1 INT);
    /*
    Msg 2705, Level 16, State 3, Line XXXXX
    Column names in each table must be unique.
        Column name 'Col1' in table '@Table' is specified more than once.
    */

    CREATE TABLE #ColumnNames(col1 INT, Col1 INT, COL1 INT);
    /*
    Msg 2705, Level 16, State 3, Line XXXXX
    Column names in each table must be unique.
        Column name 'Col1' in table '#ColumnNames' is specified more than once.
    */
    GO

    and:

    -- DB collation is Latin1_General_100_BIN2 ; tempdb collation is SQL_Latin1_General_CP1_CI_AS
    DECLARE @ComputedColumnCollation TABLE
    (
    [ID] INT NOT NULL,
    [SomeChar] AS (NCHAR([ID])),
    [String] VARCHAR(10)
    );

    SELECT *
    INTO dbo._comp
    FROM @ComputedColumnCollation;

    SELECT col.[name], col.[user_type_id], col.[collation_name], col.[is_nullable]
    FROM sys.columns col
    WHERE col.[object_id] = OBJECT_ID(N'dbo._comp')
    ORDER BY col.[column_id];
    /*
    name      user_type_id   collation_name                 is_nullable
    ID        56             NULL                           0
    SomeChar  239            SQL_Latin1_General_CP1_CI_AS   1
    String    167            Latin1_General_100_BIN2        1
    */

    DROP TABLE dbo._comp;

    Both of those cases are functional issues, not performance issues.

    What this comes down to is: whatever collation is being used by any string column (minus XML) is the collation that must be used in any work table, version store, trigger / OUTPUT clause table (i.e. INSERTED and DELETED), etc. And this has nothing to do with the collation of either DB or of the instance since every field of a result set can use a different collation. If the collation used for fields in these system-created work tables was different than the collation of the field the column is intended to capture, then you could end up with data loss for VARCHAR data if the code page was different. At the very least you would get the beloved collation mismatch error.

    For more details on what areas are impacted by which level of collation, please see:
    Changing the Collation of the Instance, the Databases, and All Columns in All User Databases: What Could Possibly Go Wrong?

    For info on working with collations in general, please visit: Collations Info

    Take care,
    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply