Blog Post

Impact on Indexes When Mixing VARCHAR and NVARCHAR Types

,

You may have heard, or read, that you should not mix VARCHAR and NVARCHAR datatypes, especially when one of them is a JOIN or WHERE predicate / condition, as doing so will invalidate indexes. While it is always best to have all datatypes be the same for a particular operation (comparison, concatenation, etc), the actual impact of mixing these two types is a bit more nuanced than most people are aware of.

To start with, we need to consider that internally, SQL Server needs the datatypes on both sides of a comparison (or concatenation) to be the same. If you have a mismatch of types and do not explicitly convert one type to the other so that they are the same, then SQL Server will handle this itself. This is called an implicit conversion and is handled by the appropriately named internal function, CONVERT_IMPLICIT() (“internal” meaning you can’t call it, but it will show up in execution plans).

Which datatype will convert to the other is determined by Data Type Precedence. Since N[VAR]CHAR has a higher precedence than [VAR]CHAR, the [VAR]CHAR value will be converted N[VAR]CHAR. Having the conversions go in this direction makes sense as it should never lead to any data loss as Unicode should be able to map all characters from all 8-bit Code Pages.

Since we know that [VAR]CHAR will convert to N[VAR]CHAR, then “accidentally” using a [VAR]CHAR value to compare against an N[VAR]CHAR column shouldn’t prevent using an index on that column since the data in that column (and hence in that index) is already “correct” for that operation.

But, on the other hand, “accidentally” using an N[VAR]CHAR value to compare against a [VAR]CHAR column should be expected to have some difficulties since the indexed values are not “correct” for that operation. And while this scenario is where we start to see issues related to the datatype mismatch, the overall impact on performance depends entirely on the type of Collation being used (as we shall see in just a moment).

There are two types of Collations in SQL Server: SQL Server Collations and Windows Collations:

  • SQL Server Collations (those starting with SQL_) are older Collations that were the only ones available prior to SQL Server 2000. These Collations use simplistic sort orders, and do not handle the great variety of linguistic rules defined by Unicode. They use string-sort (instead of word-sort), cannot do character expansions, etc. In fact, they do not have any Unicode rules defined at all, so N[VAR]CHAR data in these Collations will actually use OS-level Collation rules.
  • Windows Collations (those not starting with SQL_) were introduced in SQL Server 2000. These Collations not only have the Unicode rules defined, but they also apply those same linguistic rules to [VAR]CHAR data. This allows 8-bit character types to do word-sort (instead of string-sort), character expansions, etc. While this does come at a slight cost to performance, it also allows for consistency of behavior. And it is this consistency that helps out greatly when there is a mismatch of datatypes (as we will see in a moment).

While the SQL Server Collations have not been officially deprecated, they mainly exist for backwards compatibility and it is highly recommended that newer Windows Collations should be used unless there is a pressing need to use a SQL Server Collation. Unfortunately, the default Collation for new installations, at least with the United States, is a SQL Server Collation — SQL_Latin1_General_CP1_CI_AS — and it is this default that is likely the cause for the commonly held rule that mismatching datatypes causes problems (since most people are probably not doing additional testing with Windows Collations).

The simple example below illustrates the behavior across the four combinations of mismatching the two datatypes in each of the two Collations.

Setup:

For this test we will have a single table with four columns (each one representing one of the four previously mentioned combinations). A single row of data is added simply to avoid having an empty table (and as a result, empty indexes), just in case that would somehow affect if / how the indexes are chosen by the Query Optimizer. As the final comment indicates, please do not forget to enable “Include Actual Execution Plan” if you don’t already have it enabled.

-- DROP TABLE #IndexTest;
CREATE TABLE #IndexTest
(
  [IndexTestID] INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
  [8bit_SqlCollation] VARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
  [Unicode_SqlCollation] NVARCHAR(50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
  [8bit_WinCollation] VARCHAR(50) COLLATE Latin1_General_100_CI_AS NOT NULL,
  [Unicode_WinCollation] NVARCHAR(50) COLLATE Latin1_General_100_CI_AS NOT NULL
);
SET NOCOUNT ON;
INSERT INTO #IndexTest VALUES ('a', N'a', 'a', N'a');
CREATE INDEX [IX_#IndexTest_ASCII_Sql] ON #IndexTest ([8bit_SqlCollation] ASC);
CREATE INDEX [IX_#IndexTest_Unicode_Sql] ON #IndexTest ([Unicode_SqlCollation] ASC);
CREATE INDEX [IX_#IndexTest_ASCII_Win] ON #IndexTest ([8bit_WinCollation] ASC);
CREATE INDEX [IX_#IndexTest_Unicode_Win] ON #IndexTest ([Unicode_WinCollation] ASC);
-- Enable "Include Actual Execution Plan" (Control-M)

Tests:

This first test is merely a control to assure us that under “normal” conditions, the system works as expected and each of the indexes is indeed used.

-- First, make sure everything works as expected:
SELECT [IndexTestID] FROM #IndexTest WHERE [8bit_SqlCollation] = 'a';
SELECT [IndexTestID] FROM #IndexTest WHERE [Unicode_SqlCollation] = N'a';
SELECT [IndexTestID] FROM #IndexTest WHERE [8bit_WinCollation] = 'a';
SELECT [IndexTestID] FROM #IndexTest WHERE [Unicode_WinCollation] = N'a';
-- Index Seek for all 4 queries

Collation-01_SameDataTypeForColumnAndValue

These next two tests show the behavior when using a SQL Server Collation.

-- Next, switch up datatypes on comparisons for SQL Server Collations:
SELECT [IndexTestID] FROM #IndexTest WHERE [Unicode_SqlCollation] = 'a';
-- Index Seek
SELECT [IndexTestID] FROM #IndexTest WHERE [8bit_SqlCollation] = N'a';
-- Index Scan AND warning on CONVERT_IMPLICIT !!!!

Collation-01_UnicodeCol-SqlCollation-8bitData

As expected, there is no problem at all with the NVARCHAR column mixed with a non-Unicode value.

Collation-01_8bitCol-SqlCollation-UnicodeData

Also expected is the inability of the VARCHAR column to get an Index Seek when mixed with a Unicode value. Not only does that combination get an Index Scan, but it also has a warning on the implicit conversion. Again, the cause of this behavior is the drastically different sorting algorithms: simple ordering for VARCHAR data vs. complex linguistic rules for NVARCHAR data.

The final two tests show the behavior when using a Windows Collation.

-- Finally, switch up datatypes on comparisons for Windows Collations:
SELECT [IndexTestID] FROM #IndexTest WHERE [Unicode_WinCollation] = 'a';
-- Index Seek
SELECT [IndexTestID] FROM #IndexTest WHERE [8bit_WinCollation] = N'a';
-- Index Seek + Nested Loop + Compute Scalar + Constant Scan

Collation-01_UnicodeCol-WinCollation-8bitData

As expected, there is no problem at all with the NVARCHAR column mixed with a non-Unicode value (“Query 1” plan shown above).

Collation-01_8bitCol-WinCollation-UnicodeData

As many would probably not expect, while there is a slight hit to performance due to the implicit conversion (the “Compute Scalar” operator in the “Query 2” plan shown above), it does not get a warning. Also, unlike the VARCHAR column using the SQL Server Collation mixed with a Unicode value, this time, with the Windows Collation, we are back to getting an Index Seek :-). This is due to the Windows Collation using the same linguistic rules for both VARCHAR and NVARCHAR data.

Conclusion:

As shown above, the effect (on the ability to get optimal index usage) of mixing N[VAR]CHAR and [VAR]CHAR types is entirely dependent on which one of those is indexed and which type of Collation is being used:

  • If the indexed column is of type N[VAR]CHAR then there are no problems, regardless of which type of Collation is being used (since N[VAR]CHAR uses a Windows Collation in both cases).
  • If the indexed column is of type [VAR]CHAR and the column is using a Windows Collation, then there is a slight hit to performance for the implicit conversion (of the [VAR]CHAR value into N[VAR]CHAR), but you can still get an Index Seek due to the same linguistic sorting and comparison rules being used for both datatypes.
  • If the indexed column is of type [VAR]CHAR and the column is using a SQL Server Collation, then there is a significant hit to performance for both the implicit conversion (this combination even gets a warning), and getting an Index Scan instead of an Index Seek. This is the only combination to truly suffer from the datatype mismatch, and that is due entirely to different linguistic sorting and comparison rules being used for each datatype.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating