Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Tim Mitchell

Tim Mitchell is a business intelligence consultant, author, trainer, and SQL Server MVP with over a decade of experience. Tim is the principal of Tyleris Data Solutions and is a Linchpin People teammate. Tim has spoken at international, regional, and local venues including the SQL PASS Summit, SQLBits, SQL Connections, SQL Saturday events, and various user groups and webcasts. He is a board member at the North Texas SQL Server User Group in the Dallas area. Tim is coauthor of the book SSIS Design Patterns, and is a contributing author on MVP Deep Dives 2. You can visit his website and blog at TimMitchell.net or follow him on Twitter at twitter.com/Tim_Mitchell.

SSIS Alpha Splits using the CODEPOINT() Function

A relatively common requirement in ETL processing is to break records into disparate outputs based on an alphabetical split on a range of letters.  A practical example of this would be a work queue for collections staff based on last name; records would be pulled from a common source and then separated into multiple outputs based on a the Customer Last Name field, with the resulting output going to the person or group responsible for working that alphabetical subset of data.

There are a couple of different ways you can do this.  First is to use separate sources for each range of characters, and specify in your SELECT statement only those values that you want.  This is an effective quick-and-dirty option, but it doesn’t scale well as it requires multiple round trips to the database.  You could also accomplish this task using a simple text comparison for each letter of the alphabet, but this method is a typing-intensive operation.  For example, let’s say you want to group together the records for customers whose last names falls in the A-F range.  Using the Conditional Split transformation, your A-F output expression would look something like the following:
 

SUBSTRING(UPPER(LastName), 1, 1) == "A"
|| SUBSTRING(UPPER(LastName), 1, 1) == "B"
|| SUBSTRING(UPPER(LastName), 1, 1) == "C"
|| SUBSTRING(UPPER(LastName), 1, 1) == "D"
|| SUBSTRING(UPPER(LastName), 1, 1) == "E"
|| SUBSTRING(UPPER(LastName), 1, 1) == "F"


Your other groups would contain a similar statement to explicitly define each letter to be included in the group.  Not a complex operation, but one that requires a lot of typing.

 

An Easier Way

An easier way to do this is to use the relatively obscure CODEPOINT() function.  This method, which is part of the SSIS expression language, returns the numerical Unicode decimal value of the leftmost character of the input string.  The above grouping would be rewritten as follows using the CODEPOINT() function:

CODEPOINT(UPPER(LastName)) >= 65
&& CODEPOINT(UPPER(LastName)) <= 70


The difference is, rather than enumerating each possible starting letter within the range, I’m now evaluating the Unicode value of the first character in my LastName text field, and only including those in the 65 to 70 range (A through F inclusive) in this output.  I’ve saved myself a little typing, and this approach is easier to maintain and troubleshoot in my opinion.  A sample conditional split with four groupings is shown below:


screen1

 

Take It Up A Notch

So you might ask, “That’s great, smart guy, but why go through this just to save myself maybe 5 minutes of typing?”.  I’m glad you asked!  Let’s take our example a little bit further and assume we’re breaking these groupings down into smaller units.  Consider the possibility that, rather than grouping last names together based on the first letter of the last name, we’ve got a sufficient number of outputs that we’re now splitting the records within that first letter; for example, if we were to split the data stream where the last name starts with an M, we might slice our outputs on those starting with MA to MI, then MJ to MR, and finally MS to MZ.  By using the direct comparison method described above, our fully configured conditional split could have up to 26^2 possible permutations, which means we’ve got to do 676 comparisons (assuming all uppercase alpha characters) within the conditional split transformation, which will likely impact your package performance, not to mention the immense amount of typing required to set this up.  Fortunately, some creative use of the CODEPOINT() function can simplify this ETL requirement.

For this example, let’s assume that we need to separate our records within the letter M into three distinct groups as mentioned earlier, since statistically there are a lot of last names beginning with M.  For each “M” output, I’m going to use an direct string comparison to verify that the first letter is an M (since we’re looking for a single match and not a range in the first character), and second, I’ll use CODEPOINT() in conjunction with the SUBSTRING() function to check that the second letter falls within the expected range for each output. 

So for our first M grouping, the MA to MI group, the following expression would be used:

SUBSTRING(UPPER(LastName), 1, 1) == "M" 
&& (
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 65
&&
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 73)


The code above will match records where the first letter is a literal M, and the second character is between A (Unicode 65) and I (Unicode 73) inclusive. Similarly, the MJ to MR expression reads as such:
 

SUBSTRING(UPPER(LastName), 1, 1) == "M" 
&& (
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 74
&&
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 82)


And finally, the MS to MZ expression:

SUBSTRING(UPPER(LastName), 1, 1) == "M" 
&& (
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 83
&&
CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 90)

 

The partially configured conditional split transformation would look similar to the following:

screen2

So you can see that you’ve still got a small chunk of code to write (or copy/paste and modify) for each of your outputs, but it’s far less trouble – and better performing, no doubt – than enumerating all of the possible combinations of the first two letters of the LastName field.  The further you go into the string for your split (for example, breaking all the way down to split “McA” to “McF”, “McG” to “McN”, etc.), the more significant your efficiency in using this method over direct comparisons.

One caveat that bears mentioning: You’ll notice that I’ve used the UPPER() function generously in these examples.  The reason for this is twofold: First, a direct string comparison in the SSIS expression language is case sensitive; for example, “M” does not equal “m”.  Second, the same holds true for the Unicode decimal values returned by CODEPOINT().  Uppercase M, or Unicode value 77, does not equal lowercase m, or Unicode value 109.  Use of the UPPER() function helps to ensure that we’re making accurate comparisons regardless of case.


Conclusion

The CODEPOINT() is a rarely used function in the SSIS expression language, but can be an effective tool in your ETL arsenal in some cases.  For alphabetical grouping or splitting of records, it’s a very handy function that helps to reduce a lot of typing at design time.

More information about the CODEPOINT() function can be found at this page on MSDN.

Comments

Posted by Jack Corbett on 4 February 2010

Nice post.  Interesting as I'd never heard of the function.

Posted by Dugi on 4 February 2010

Thanks for the info about this function ...!

Posted by Steve Jones on 4 February 2010

Interesting. Is this function used often in .NET work? I would never have thought of something like this.

Posted by Dugi on 4 February 2010

I can see that many of users didn't use it, so I'm gonna ask you Tim did you use it before!?

Posted by edwisdahl on 5 February 2010

Interesting way to do that split.  I'm just curious, have you tried just use regular expressions to do this type of matching?

Posted by Tim Mitchell on 5 February 2010

The regular expression idea is interesting, but unfortunately there aren't many good ways to use a regex within the native SSIS controls (though I understand there are some good third-party options).

Steve, I haven't come across another CODEPOINT function elsewhere, though most languages (including those of the .NET variety) allow converting a single character to its integer value, which is essentially what we're doing here.

Posted by Tim Mitchell on 5 February 2010

Dugi, the only production use I've had for this is for the alpha split.  I can imagine some other possible uses, though, now that I've got this approach in the toolbox for future use.

Posted by jez on 12 February 2010

Interesting, I suppose the downside is 65 and 70 isn't as easy to understand as A and F.

Posted by mvelusamy on 12 February 2010

It sounds good to know the new function however as Jezz* pointed out the downside of the usage.

SUBSTRING function is more powefull in anycase ..

Say for range of values

Example # 1

SUBSTRING(UPPER(LastName), 1, 1) == "A"

|| SUBSTRING(UPPER(LastName), 1, 1) == "B"

|| SUBSTRING(UPPER(LastName), 1, 1) == "C"

|| SUBSTRING(UPPER(LastName), 1, 1) == "D"

|| SUBSTRING(UPPER(LastName), 1, 1) == "E"

|| SUBSTRING(UPPER(LastName), 1, 1) == "F"

This can be re-written as

SUBSTRING(UPPER(LastName), 1, 1) >= "A"

&& SUBSTRING(UPPER(LastName), 1, 1) <= "F"

Example #2

For the MS to MZ expression:

SUBSTRING(UPPER(LastName), 1, 1) == "M"  

&& (CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 83

&& CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 90)

Re-write as

SUBSTRING(UPPER(LastName), 1, 2) >= "MS"  

&& SUBSTRING(UPPER(LastName), 1, 2) <= "MZ"  

Similary we can use for any range / any combination of string values. This gives benefit to understand the functionality more clear than CODEPOINT() which is smilar to the ASCII value of the character only for the LEFT MOST single char.

Posted by drwhitaker on 12 February 2010

I was not aware of CODEPOINT - now I will have to investigate.

Posted by Jedak on 12 February 2010

@Steve CodePoint is not a function in the .NET Framework as far as I can tell.  I appears to be something specific to SSIS.

Posted by Jason Marshall on 12 February 2010

Great post, Tim

Posted by djnewman on 12 February 2010

Although this is a nice hack, wouldn't it faster and cleaner to select a new column from the source data with the T-SQL ASCII() which is the CODEPOINT equivalent function? Or barring even that, you could just substring the first letter of the last name and use between. That would remove 99% of the code above as you could do the grouping with SQL views or result sets.

In C#.Net, the equivalent is rather messy because you need to convert the character to an integer. VB.Net has more support with the asc() function.

In the SQL case and the .Net case, there may be issues with Unicode being 16 bits instead of 8 bits. I'm not certain, but this could affect the CODEPOINT work above as well if you have different codepage data stored.

Posted by Tim Mitchell on 14 February 2010

DJ, you're right that there are easier ways to do this on the source; however, in doing so you have to make a few assumptions.  First, the data source has to be a relational database, and specifically must be one that supports such a function.  SQL Server does, of course, but other RDBMS systems may or may not.  Further, the source may not be a database at all, and could be a text file or other unmanaged data store, in which case you must do this in the ETL layer rather than the source.

Also, as I mentioned in the article, doing your grouping on the source will work but doesn't scale well, especially if you have a large number of groups that will require a round trip to the database (as well as the administrative overhead of configuring a separate source in SSIS) for each logical grouping of records.

Tim

Posted by hrvoje.piasevoli on 1 March 2010

The proposed way of solving the stated problem is bad practice. The proper design is already given by a user mvelusamy but can be even more simplified.

Using proper operator on the string datatype (<,>,=,>=,<=) results in a readable, flexible and efficient solution.

Example 1: all names beggining with letters A to F inclusive

UPPER([LastName]) >= "A" && UPPER([LastName]) < "G"

Notice that there is no need for SUBSTRING().

Exapmle 2: (So for our first M grouping, the MA to MI group, the following expression would be used: )

UPPER([LastName]) >= "MA" && UPPER([LastName]) < "MJ"

(Similarly, the MJ to MR expression reads as such: )

UPPER([LastName]) >= "MJ" && UPPER([LastName]) < "MS"

(And finally, the MS to MZ expression: )

UPPER([LastName]) >= "MS" && UPPER([LastName]) < "N"

Notice that there is no need for SUBSTRING().

But still there is one other point when doing alpha splits that has to do with collation. Be aware that the string data that you are splitting may be in some collation that has sorting rules more complex than comparing the 1st character. For example, in Croatian we have letters in alphabet that are effectively composed of two characters ("NJ", "LJ", "DŽ"). So the ascending sort order for the words "LUKA" and "LJUBO" is "LUKA","LJUBO" because in the specific collation letter "L" commes before letter "LJ".

Posted by Tim Mitchell on 7 March 2010

Hrvoje, thanks for pointing out a better way to do this.  You're right that using the direct character comparison works better than the CODEPOINT() method I suggested.  I posted a follow-up blog to this here: www.timmitchell.net/.../alpha-split-in-ssis-redux.aspx

Leave a Comment

Please register or log in to leave a comment.