Case Sensitivity in Selects - Part 3

,

I've published two prior articles

(Case Sensitivity in Selects - Part 1

and Case Sensitivity in Selects - Part 2),

you'll benefit from reading

them before continuing with this one. I'd like to start by looking at part of a

great email I received from reader John Sands (check out his

blog!). As John explained, he read

the first article and wanted to make sure he understood the code AND it worked

before passing it around. Bravo! Anyway, he wrote because he was unable to

achieve the results I mentioned using the convert to binary option. I'll start

by including his text, then look at what happened. After that I'll discuss the

underlying problem that started this journey and see if I can come up with any

alternatives.

I did this update in Northwind to test it:

update Orders

set CustomerID = 'frans', ShipCity = 'torino'

where orderid = 10753

I did two columns to try nchar and nvarchar columns. CustomerID is

char(5) and ShipCity is nvarchar(15).

These queries return six rows:

select OrderID, CustomerID, ShipCity

from Orders

where CustomerID = 'frans'

select OrderID, CustomerID, ShipCity

from Orders

where ShipCity = 'torino'

OrderID CustomerID ShipCity

----------- ---------- ---------------

10422 FRANS Torino

10710 FRANS Torino

10753 frans torino

10807 FRANS Torino

11026 FRANS Torino

11060 FRANS Torino

(6 row(s) affected)

OrderID CustomerID ShipCity

----------- ---------- ---------------

10422 FRANS Torino

10710 FRANS Torino

10753 frans torino

10807 FRANS Torino

11026 FRANS Torino

11060 FRANS Torino

(6 row(s) affected)

I want to return just the lowercase row but NONE of these return any

results:

select OrderID, CustomerID, ShipCity

from Orders

where convert(varbinary(50), CustomerID) = 'frans'

select OrderID, CustomerID, ShipCity

from Orders

where convert(varbinary(15), ShipCity) = 'torino'

select OrderID, CustomerID, ShipCity

from Orders

where convert(varbinary(5), CustomerID) = 'frans'

select OrderID, CustomerID, ShipCity

from Orders

where convert(binary(5), CustomerID) = 'frans'

select OrderID, CustomerID, ShipCity

from Orders

where convert(varbinary(5), CustomerID) = convert(varbinary(5), 'frans')

select OrderID, CustomerID, ShipCity

from Orders

where cast(CustomerID as varbinary) = cast('frans' as varbinary)

select OrderID, CustomerID, ShipCity

from Orders

where cast(ShipCity as varbinary) = cast('torino' as varbinary)

select OrderID, CustomerID, ShipCity

from Orders

where cast(CustomerID as binary) = cast('frans' as binary)

select OrderID, CustomerID, ShipCity

from Orders

where convert(binary(5), CustomerID) = convert(binary(5), 'frans')

 

He's exactly right, none of those work! Took a few minutes to figure out, but

the culprit was that the customerid column was an nchar. Because it's a binary,

you have to make sure to compare apples to apples. To help me figure

out what was going on, I included the converts in the select so I could see

exactly what was occurring, like this:

select OrderID, CustomerID, ShipCity, convert(varbinary(5), convert(varchar(5), customerid)),
convert(varbinary(5), 'frans') from Orders
where convert(varbinary(5), convert(varchar(5), CustomerID)) = convert(varbinary(5), 'frans')

Just another 'gotcha' when dealing with mixed case sensitivity.

As I mentioned way back in part one this all started with me tracking down a

performance issue. That got me curious and now that I've explored this ugly part

of SQL, it's time to see if I can't solve the problem more elegantly. Here is

basically the table I'm dealing with at work:

A pretty standard lookup table right? Well, where we got off track is that

the product that implements this was meant to run on multiple database platforms

and because (as I understand it) not all databases supported case insensitivity,

they went with the lowest common denominator. All that means is that they always

did a case sensitive comparison. We could have stopped the madness by just

making the column unique! Instead, we have junk like I've shown above where we

have a case insensitive column that has duplicate values that aren't duplicates.

That means we have to account for them in every join we do. What's worse, we

can't even easily stop users from adding more since we can't set a unique index

(case insensitive).

The first option is to identify the duplicates and remove them. To do that

you have to find every place where it might have been used and replace it with

the new code. Potentially in any of 40 or so columns across 10 tables in each of

about 200 databases. Doable, just takes a little time to set up. Once done I

could add a unique index to make sure the problem doesn't recur.

Now that I think about it, what's wrong with that? Do it and be done!

One thought I had was to add a new column called perhaps CodeInsensitive and

store a binary representation of 'Code'. That way I could still use the

technique I offered in article #1 about doing the join plus the compare, but now

I'd have the big part of the work already done. Why convert each time if you can

just do it once? Easy enough to maintain via a computed column or triggers, and

it would be indexable. Even uniquely indexable. Worth doing? Questionable I

think, doesn't seem like I'd gain much in performance over what I've already

done.

Tried to think of a way to use a view to help, but it doesn't take me

anywhere.

The only remaining option I can see is to modify everything that does the old

style case sensitive join to use the new method I described. I'll have to check

to see how many stored procs this might touch. Changing them is probably faster

than doing all the updates to remove the duplicates, about even in risk maybe.

At this point I haven't decided. Removing the duplicates is clearly the right

thing to do, just changing the code is probably faster. Tradeoffs to be made.

I imagine this only scratches the surface of issues you can have using

multiple collations. I think I'll just try to avoid them. In closing, you might

think about how case sensitivity affects other tools/languages. XML is case

sensitive. VB6 and VB.Net are not case sensitive, but the string comparisons

they do are unless you've set Option Compare Text on (please don't). C# and many

others the language is case sensitive.

Rate

Share

Share

Rate