Blog Post

Conversion Between Unicode and Non-Unicode Data Flow Pattern

,

I don't think I'm alone when I find myself having to convert several columns in my Data Flow from Unicode strings to non-Unicode strings, or vice versa.  Reading from text files, Excel sources, ODBC drivers, or lookups resulted in my Data Flow being populated with one string type or the other. 

My destination (naturally) wanted the other type of string, and SSIS's type strictness rightly prevented a straight-up implicit conversion, resulting in "cannot convert between unicode and non-unicode string data type" errors:

Under the covers, I'm trying to push a non-unicode string column called UserName into a database table that wants a unicode string.  What's really occurring in the data flow looks something like this, and it doesn't work:

The "By The Book" Solution

To explicitly convert those columns, you're supposed to use a Data Conversion component (although a Derived Column will serve this purpose admirably as well).  When you hook the Data Conversion component up to the flow, you select the columns you want to convert, and select the (target) Data Type. 
The component (unhelpfully) names this new converted column "Copy of [original column name]" which you can choose to alter, and you carry on with the rest of your flow.  This is how the data flow now looks:
I've shown the "automatic" mappings with green arrows, and the things you have to manually set with the blue arrows.  In the above, the blue arrows represent the work you have to do to set up the Data Conversion component to change the data type (creating the new column) and the work you have to do to manually change the mapping in the OLE DB Destination.

Issues With "The Book"
The prescribed solution above works perfectly... for perfect people... but I'm not one of those.  I frequently found myself stumbling over a few problems.
I really don't like the "Copy of" name that was given to the new column.  Primarily because it wasn't a copy of that column - it was a derived value.  To me, that was like using a Derived Column transformation to create a new column from an existing date column called "Today" with the expression DATEADD("day", 1, [Today]) and calling the new column "Copy of Today".  It's not.  It's tomorrow.  Of course, SSIS has absolutely no way of knowing this semantic distinction.  I found myself changing the default converted output column name to things like "original column_str"... but that didn't quite sit right because of the occasional mixture of spaces and underscores.  (Yeah, I'm anal that way.)
With a little counseling, I could probably see past the naming issues... except that I would find myself inadvertently using the original column later on in the data flow - because it was named appropriately, and appeared in the column list before the converted column I should have been using.  Quite frequently, I found my mistake at the end of the data flow after realizing I'd used the wrong column all the way through the flow.

Facepalm.

In order to protect myself and still properly transform the data, I found myself performing the following "dance"...
The Workaround
The end result I wanted was to convert the columns to the desired data type, but keep the column name the same - because it was a column name I (and those that will read my package later) would understand.  I also wanted to remove the possibility that the original inappropriately typed column couldn't be used.  I also wanted to make it perfectly clear what was happening in my workaround - being clever is fine... but don't outsmart yourself (or the next guy).

The workaround consists of using a Data Conversion component immediately followed by a Union All.  The Data Conversion did its standard job - creating a new, converted column with a dumb name.  The purpose of the Union All was to reconstruct the data flow by eliminating the "original" columns, and rename the converted columns back to the original names, like this:

You'll see that I've still used the green and blue arrows here - the result I'm looking for is that last green arrow, the automatic mapping of the correct column name.
Using the Pattern - Pros and Cons
The above workaround is foolproof - I know it is because it saves me from making that mistake every day.  I use it every time I do a simple data type conversion, whether it's strings to integers or floats to currency.  That's the overwhelming positive for me.  However, there are drawbacks to using this pattern - the Union All is an asynchronous component, which ends up throwing a few curves our way. 
On the good side, it's not a "blocking" component - it doesn't hold up all the incoming rows until it's seen them all before sending rows out (like the Sort component).  This is a good thing, because it doesn't consume a lot of memory.
On the bad side, it causes the Data Flow engine to stop using the buffer (block of memory) it has the data in, and causes a memory copy to be made into a new buffer of a different "shape" (columns and data types).  It does this because it assumes that I am joining several flows together, as well as renaming and removing columns from the buffer.  Joining several flows together (probably) means that they all have different shapes, and as such, you can't simply pass them out the other side - the output has to have a consistent shape.
The end result (regardless of reasons) is that a memory copy takes place in this pattern where it otherwise shouldn't have to - and that can slow down your data flow in two ways.  First, the copy simply takes time.  Second, copying means that you're breaking the "execution tree" of the data flow into two parts manually - meaning SSIS's optimizer is stuck with working around that decision, which may result in a less optimal worker thread allocation.  (In SSIS 2005, it could actually improve performance in long flows, as 2005 didn't optimize those execution trees as well as 2008 does.)
Still Unresolved
Even with this pattern, I still have one of my original gripes unresolved, and have actually added some more.  I still have a ton of clicks to go through to set this up, especially when I have more than one column to convert.  I've just published a component on CodePlex that addresses those issues for me - and I'll blog about that next...

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating