Lookup Transformation Case Sensitivity Question

Question

Lookup Transformation Case Sensitivity Question

Brandie Tarvin

SSC Guru

Points: 173108
More actions
April 9, 2008 at 2:34 pm

#124766

I just read a technet article that says the LOOKUP Transformation task is case sensitive in its matches.
Can someone verify if this can cause problems with matching in case INsensitive SQL Databases?
Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply

Tim Mitchell SSCoach Points: 15695 More actions · Answer 1

'fraid so, Brandie. I just ran into this recently, and had to create a derived column on which to do my lookups. I'm trying to get some time to play with SSIS in SQL 2008 to see if we will have the option to do a case-insensive lookup like the one offered in Informatica.

By the way, I saw that you are presenting in Jacksonville next month. I'm be presenting a session on SMO programming.

Tim

Tim Mitchell
TimMitchell.net | @tmitch.net | Tyleris.com
ETL Best Practices

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 2

Case-sensitive is just the beginning. When using lookup, merge join, merge, aggregate, sort and any other included component that does any sort of join or SORTING, the component is case-sensitive and unaware of most of your coalation settings.

The comparison operators are not in T-SQL, so this makes perfect sense. It is easy to work around once you realize it. I tend to upper-case and trim everything just to be sure. In most cases, this will not hurt performance anyway, so it is often good practice.

The sorting is one that has hit me a few times. Your OLEDB source can be from a sorted query and you can go to the input/output section of the OLEDB component and indicate that the data set is sorted and by which columns to avoid using a SORT component with a large data set. If you then use another data set that is sorted with an SSIS sort component and MERGE JOIN them together, they can be sorted differently unless you are very careful.

I don't necessarily think any of this is entirely bad. I certainly would not consider it a bug. Case-insensitive (actually coalation-configurable) joining and sorting components would be nice, but I can see how it would be a big task to do it correctly. I just think they should have noted this a bit better some place that we would have seen it - like on the editor for the relevant components.

Chinna SSCertifiable Points: 5408 More actions · Answer 3

HI MICHEAL AND TIM..TJKZ FOR UR INFO..BUT I AM NEW TO SSIS SO I DOUBT IF I JUS WANT TO HAVE A LOOKUP THEN HOW COULD IT MATCH ...

IF IN TABLE I HAVE NAME AS "ABC"

AND IN OTHER TABLE IT HAS NAME AS "abc" THEN HOW COULD I WRITE THE STATEMENT SO THAT IT CAN LOOKUP ALL THE NAME WITH UPPER AND LOWER CASE...

Thanks,
Chinna

Its the Journey which gives you Happiness not the Destination-- Dan Millman

Tim Mitchell SSCoach Points: 15695 More actions · Answer 4

Use a Derived Column transformation and create a copy of the column on which you want to match. When you copy the column, use the expression LOWER([original_column]), where [original_column] is the name of the column on which you will do the lookup. In your Lookup component, use the Query window rather than pulling straight from the table and use the SQL function LOWER([column_to_match]) to make that lowercase as well.

hth,

Tim

Tim Mitchell
TimMitchell.net | @tmitch.net | Tyleris.com
ETL Best Practices

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 5

That's the approach I would take as well (except I would use upper-case for some reason).

As an alternative, you could go to the advanced tab of the lookup control, turn on caching, and modify the SQL statement used for the comparison.

Chinna SSCertifiable Points: 5408 More actions · Answer 6

thkz tim n micheal ..gotch u

Thanks,
Chinna

Its the Journey which gives you Happiness not the Destination-- Dan Millman

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 7

I've found that the character mapping tranformation also works. But the question is, which one is better? Character Map or Derived Column?

For me, I think I'll drop the LOOKUP all together if I can get away with it. It's taking too long to cache the 11 million records I'm trying to match against and I can't figure out how to do a Reverse Lookup so that I can take the records from the smaller DB (which is the one I'm inserting from) in the cache. I'm going to try and do this task in a stored procedure and use that in my OLEDB source.

I look forward to meeting you, Tim. I hope I can get a chance to sit in on your SMO session as my skills with SMO aren't that great. (I've done exactly one thing with it. @=).

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 8

11M is a bit much for a lookup.

You could use a second data source and a merge join, but you will probably get the best performance out of staging the data and using a T-SQL join.

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 9

Michael,

The problem is that I'm working on "staging the data" right now. I've already got everything in two separate SQL DBs (mainframe uploaded to SQL and the SQL in a SQL DB). Now I just need to exclude the duplicate records from the two datasets.

The only solution I had found to date (other than T-SQL) is a LOOKUP to check for the existence of records. If anyone has any better suggestions than T-SQL or LOOKUP, I'd love to hear it.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 10

Are you pulling data from the two data sets in the same data flow?

If you are, there would be very little "extra" overhead in sorting your data sets and using a merge join or a merge if all of your columns are the same.

Barring that, I would run both of your data sets into two staging tables and use T-SQL to union them together for the final import step.

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 11

I've actually changed my two sources to two different dataflows that end up in the same destination staging table.

Unfortunately, while both sources do have crossover records, both sources also have records that are unique to each source. Maybe I'm wrong, but I don't think a MERGE JOIN would help me in this scenario. Would it?

Last I checked, the MJ transformation only offered INNER, RIGHT OUTER and LEFT OUTER for the joins when I really need a FULL OUTER sort of scenario for what I'm working with.

If I'm wrong, please let me know.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

Michael Earl-395764 SSC Guru Points: 53873 More actions · Answer 12

Ok - you're wrong.

The Merge Join options are INNER, LEFT, AND FULL OUTER.

I actually never noticed that you cannot do a RIGHT join with it, but I suppose you could make either input the left input so you would just have to reverse it.

Paul Williams SSCommitted Points: 1960 More actions · Answer 13

If you're finding that 11 million records are taking to long to full cache, why don't you try partial caching.

You may find it actually works quicker depending upon the amount of records that ultimately match.

Experiment with the caching size. I had a table with a LKP using 18 million records. It was actually quicker to use partial caching than full, with an optimised query and indexes.

Paul R Williams.

Brandie Tarvin SSC Guru Points: 173108 More actions · Answer 14

Michael, thanks for the correction. For some reason I really thought it wouldn't allow FULL OUTER.

Paul, really? Everything I've read said that partial & no caching is much slower than full caching.

Well, now that I'm better informed, I'm going to try a few different things.

Did find out, though, that the expressions in Conditional Split are also case sensitive when it comes to field names, even if the field names come from a case insensitive DB... So I guess I need to treat all things in the package as if it's case sensitive just to be safe.

Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.