Case Sensitivity in Selects - Part 2

,

In a

recent article I discussed some potential performance issues that result

from doing a case sensitive join on a column that is not case sensitive and

offered one alternative. This time I'm going to look at whether it would make

sense to change the collation on the column to case sensitive. You definitely

should read the earlier article before continuing.

I've started by setting the column to case insensitive. Running the following

two queries confirms that it is working correctly. The first query returns one

row, the second query returns zero rows.

select * from instantforum_members where fullname='andy warren' 
select * from instantforum_members where fullname='Andy warren'

Good. I made a new table that contained just an identity column as a primary

key and the fullname column, with the collation for the fullname column set to

case insensitive. I then ran a simple query to see what would happen.

select top 10 * from instantforum_members m inner join andytemp a on m.fullname = a.fullname

This results in an error:

 

This is a big problem. If you change the collation on the column and you have

existing queries that already join to it you may be breaking quite a few things.

Changing the query to include a collation (sensitive or insensitive as you

prefer) allows the query to run normally.

select top 10 * from instantforum_members m inner join andytemp a on m.fullname = a.fullname
collate sql_latin1_general_cp1_ci_as
select top 10 * from instantforum_members m inner join andytemp a on m.fullname = a.fullname 
collate sql_latin1_general_cp1_cs_as

Or does it? I set one row to have the value 'andy warren', the other to have

'Andy Warren', both changes made to the andytemp table, verified that I had the

same configuration in members (I was experimenting while writing). I ended up

testing four versions of the query:

--#1, sensitive, where on the sensitive column
select top 10 a.fullname as afullname, m.fullname as mfullname
 from instantforum_members m 
	inner join andytemp a 
		on m.fullname = a.fullname
 collate sql_latin1_general_cp1_cs_as 
 where m.fullname='andy warren'
--#2, sensitive, where on the insensitive column
select top 10 a.fullname as afullname, m.fullname as mfullname 
 from instantforum_members m 
	inner join andytemp a 
		on m.fullname = a.fullname
 collate sql_latin1_general_cp1_cs_as 
 where a.fullname='andy warren'
--#3, insensitive, where on the sensitive column
select top 10 a.fullname as afullname, m.fullname as mfullname 
 from instantforum_members m 
	inner join andytemp a 
		on m.fullname = a.fullname
 collate sql_latin1_general_cp1_ci_as 
 where m.fullname='andy warren'
--#4, insensitive, where on the insensitive column
select top 10 a.fullname as afullname, m.fullname as mfullname 
 from instantforum_members m 
	inner join andytemp a 
		on m.fullname = a.fullname
 collate sql_latin1_general_cp1_ci_as 
 where a.fullname='andy warren'

Care to guess how many rows each query returns?

  • #1    1 row
  • #2    2 rows
  • #3    2 rows
  • #4    4 rows

Column level collations give me a headache. We haven't even gotten back to my

original focus which was making the query perform better, what I see here is

that if I did change the collation I could get one of 3 possible result sets

back depending on how I modified the queries.

I've looked at it three times, the headache is getting worse. How can I

finish the article if I don't know why I get three different answers! Ever

notice that growing is painful?

How to figure out what is going on? I go back to look at the query plans as a

starting point. Here is what the cost of each was:

  • #1    1.15
  • #2    .00994
  • #3    .00994
  • #4    1.18

Now let's dive into each.

Query #1 only returns one row because there is only one row in members that

matches the where clause and it has only one match in andytest. It's generating

an index scan against andytest.

Query #2 matches two rows (I think) because andy warren = andy warren and

Andy Warren = Andy Warren and the where clause is applied to a case insensitive

column. The results of the query are shown below.

Query #3 is a case insensitive join that I expect to return 4 rows - the

product - but only two are returned, apparently because the where clause

restricts the member table to one row after the join is applied.

Query #4 is an insensitive join and the where applied on the other side, and

it does return the expected four rows. It also results in an index scan against

members, a result of having to convert the case sensitive column for the join.

Am I deciphering that correctly? I'm open to other interpretations.

As far as performance, the combination of the collation and where you apply

the where filter make all the difference. If the optimizer can restrict the

results up front by filtering against a column that doesn't require applying a

different collation it's fast. If it has to apply the collation and then filter,

it's slow.

To wrap up, applying the collation to the column solves one problem but it

potentially creates another (note that this is only when you are mixing

collations) in that where and how you filter makes a huge difference in how many

rows you get back. I'm not sure I'd want to try to explaining this or be

responsible for testing all the results. Maybe I'm just used to coding it and it

works out to be the same. Definitely we know that you can't just go changing the

collation without being prepared to fix everything that joins to it.

I've still got an idea or two to pursue on this, we'll take that up next

time.

Part 3 is available.

Rate

Share

Share

Rate