So I’ve discovered another benefit of being a technical blogger. Not only do you get some kudos when you write something that helps someone else, but if you offer up a less-than-optimal solution, you’ll get some suggestions on how it can be done better. I’ve had my share of the former, but earlier this week I experienced the latter.
Last month I blogged about using the SSIS CODEPOINT() function to make it easier to split data streams based on a character value. Although the method I suggested gets the job done, of my blog readers wisely pointed out to me that you can do range comparisons on character data using the SSIS expression language. So, rather than doing a greater/less than comparison on the ASCII values of each of the significant comparison characters (the first letter, or perhaps the first two letters if you’re splitting deeper than just the first letter), you can simply run an alphabetical comparison, such as the following:
As you can see, the syntax and complexity of this solution is much simpler than what I proposed using CODEPOINT(). Thanks to Hrvoje Piasevoli for pointing out a better way to do this.
A relatively common requirement in ETL processing is to break records into disparate outputs based on an alphabetical split on a range of letters. A practical example of this would be a work queue for collections staff based on last name; records would be pulled from a common source and then separated into multiple outputs based on a the Customer Last Name field, with the resulting output going to the person or group responsible for working that alphabetical subset of data.
There are a couple of different ways you can do this. First is to use separate sources for each range of characters, and specify in your SELECT statement only those values that you want. This is an effective quick-and-dirty option, but it doesn’t scale well as it requires multiple round trips to the database. You could also accomplish this task using a simple text comparison for each letter of the alphabet, but this method is a typing-intensive operation. For example, let’s say you want to group together the records for customers whose last names falls in the A-F range. Using the Conditional Split transformation, your A-F output expression would look something like the following:
SUBSTRING(UPPER(LastName), 1, 1) == "A" || SUBSTRING(UPPER(LastName), 1, 1) == "B" || SUBSTRING(UPPER(LastName), 1, 1) == "C" || SUBSTRING(UPPER(LastName), 1, 1) == "D" || SUBSTRING(UPPER(LastName), 1, 1) == "E" || SUBSTRING(UPPER(LastName), 1, 1) == "F"
Your other groups would contain a similar statement to explicitly define each letter to be included in the group. Not a complex operation, but one that requires a lot of typing.
An Easier Way
An easier way to do this is to use the relatively obscure CODEPOINT() function. This method, which is part of the SSIS expression language, returns the numerical Unicode decimal value of the leftmost character of the input string. The above grouping would be rewritten as follows using the CODEPOINT() function:
CODEPOINT(UPPER(LastName)) >= 65 && CODEPOINT(UPPER(LastName)) <= 70
The difference is, rather than enumerating each possible starting letter within the range, I’m now evaluating the Unicode value of the first character in my LastName text field, and only including those in the 65 to 70 range (A through F inclusive) in this output. I’ve saved myself a little typing, and this approach is easier to maintain and troubleshoot in my opinion. A sample conditional split with four groupings is shown below:
Take It Up A Notch
So you might ask, “That’s great, smart guy, but why go through this just to save myself maybe 5 minutes of typing?”. I’m glad you asked! Let’s take our example a little bit further and assume we’re breaking these groupings down into smaller units. Consider the possibility that, rather than grouping last names together based on the first letter of the last name, we’ve got a sufficient number of outputs that we’re now splitting the records within that first letter; for example, if we were to split the data stream where the last name starts with an M, we might slice our outputs on those starting with MA to MI, then MJ to MR, and finally MS to MZ. By using the direct comparison method described above, our fully configured conditional split could have up to 26^2 possible permutations, which means we’ve got to do 676 comparisons (assuming all uppercase alpha characters) within the conditional split transformation, which will likely impact your package performance, not to mention the immense amount of typing required to set this up. Fortunately, some creative use of the CODEPOINT() function can simplify this ETL requirement.
For this example, let’s assume that we need to separate our records within the letter M into three distinct groups as mentioned earlier, since statistically there are a lot of last names beginning with M. For each “M” output, I’m going to use an direct string comparison to verify that the first letter is an M (since we’re looking for a single match and not a range in the first character), and second, I’ll use CODEPOINT() in conjunction with the SUBSTRING() function to check that the second letter falls within the expected range for each output.
So for our first M grouping, the MA to MI group, the following expression would be used:
SUBSTRING(UPPER(LastName), 1, 1) == "M" && (CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 65 && CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 73)
The code above will match records where the first letter is a literal M, and the second character is between A (Unicode 65) and I (Unicode 73) inclusive. Similarly, the MJ to MR expression reads as such:
SUBSTRING(UPPER(LastName), 1, 1) == "M" && (CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 74 && CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 82)
And finally, the MS to MZ expression:
SUBSTRING(UPPER(LastName), 1, 1) == "M" && (CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) >= 83 && CODEPOINT(SUBSTRING(UPPER(LastName), 2, 1)) <= 90)
The partially configured conditional split transformation would look similar to the following:
So you can see that you’ve still got a small chunk of code to write (or copy/paste and modify) for each of your outputs, but it’s far less trouble – and better performing, no doubt – than enumerating all of the possible combinations of the first two letters of the LastName field. The further you go into the string for your split (for example, breaking all the way down to split “McA” to “McF”, “McG” to “McN”, etc.), the more significant your efficiency in using this method over direct comparisons.
One caveat that bears mentioning: You’ll notice that I’ve used the UPPER() function generously in these examples. The reason for this is twofold: First, a direct string comparison in the SSIS expression language is case sensitive; for example, “M” does not equal “m”. Second, the same holds true for the Unicode decimal values returned by CODEPOINT(). Uppercase M, or Unicode value 77, does not equal lowercase m, or Unicode value 109. Use of the UPPER() function helps to ensure that we’re making accurate comparisons regardless of case.
Conclusion
The CODEPOINT() is a rarely used function in the SSIS expression language, but can be an effective tool in your ETL arsenal in some cases. For alphabetical grouping or splitting of records, it’s a very handy function that helps to reduce a lot of typing at design time.
More information about the CODEPOINT() function can be found at this page on MSDN.
So the question came up earlier today about the RIGHT() and LEFT() functions in the SSIS expression language. Like the Transact-SQL functions, one might assume that these functions would exist in SSIS expression language to snatch a specified subset of a string. That assumption would be only half right.
Don’t go digging for a LEFT() function in the expression language, because it ain’t there. The RIGHT() function does indeed retrieve a specified number of characters, but strangely enough, there’s no corresponding LEFT() function:
Even though this is a pain for those just learning the expression language syntax, there are a couple of easy workarounds: One could simply use the SUBSTRING() function, with the second parameter – the starting point in the string – set to 1, which yields the same result. If you want to get really crafty, you could use the RIGHT() combined with the REVERSE() function to simulate the behavior expected.
There’s already a Microsoft Connect item for this issue, and it’s planned to be fixed in a future version.
I got the opportunity to present to the PASS DBA Virtual Chapter today, discussing the properties and practical uses of SSIS expressions and package configurations. Thanks to Greg Larsen and the other members of this virtual chapter for allowing me to present. We had a good turnout, about 40 people, which is not bad for a lunchtime presentation.
I’ve published the code samples and slide deck if you’re interested; these can be downloaded here. The LiveMeeting session was recorded, and should be published on the PASS website soon.
Join me tomorrow at 1:00pm CST as I present "Dynamic SSIS: Using Expressions and Configurations" for the PASS DBA virtual chapter:
In this session, we'll review the use of expressions and configurations that can help make your SSIS packages more dynamic and flexible. We'll cover the basics of the SSIS expression language and will demonstrate some practical uses, and will discuss using package configurations to abstract hard-coded package settings into external resources for maximum flexibility.
To join the LiveMeeting: https://www.livemeeting.com/cc/usergroups/join?id=75P98G&role=attend
Next week, I have the honor of presenting two different sessions on SSIS. These 2 events are the last speaking engagements on my calendar for this year:
On Monday, December 14 at 11:30 CST, I’m presenting a SQL Lunch session to discuss looping logic in SSIS using the For Loop and the For Each Loop. This will be a working session, consisting almost exclusively of demos. Thanks to Patrick LeBlanc for yet another opportunity to present to this group.
The following day, Tuesday, December 15th at 11:00 CST, I’ll present “Looping, Moving Files, and Splitting Data Streams: Intermediate SSIS Tasks for the DBA” for the PASS DW/BI virtual chapter.
See you there!
For those interested in the SQL Server BI space, there is a new online resource launching today. Brian Knight and my friends over at Pragmatic Works have released the Business Intelligence Developer Network, a virtual community of business intelligence professionals. You’ll find discussion forums, blogs, articles, a file/script sharing repository, and a community event calendar. All resources are free, though you’ll need to create an account (also free) to access some of the content.
Want to become an author? BIDN welcomes submissions from the community, and even offers a little compensation for articles published on the site. This is an excellent way to help others, build a new community, and strengthen your own street cred.
I’d like to say thanks to Andy and Judith over at the Ft. Worth SQL Server User Group for having me as a guest last Wednesday. I got to meet a few new people, and meet up with several others whom I hadn’t seen in a while. I look forward to visiting over there again.
For those in attendance, I’ve posted my slide deck and sample code here.
If you’re in the Ft. Worth area next Wednesday, November 18, please join me at the Ft. Worth SQL Server User Group meeting at 6:30, where I have the honor of presenting my session entitled Intermediate SSIS. This will be my first trip to the Ft. Worth group (my home group is over in Dallas), so I’m looking forward to meeting new people as well as the opportunity to speak.
More information and directions can be found on the user group website.
Looking forward, I’m also doing a SQL Lunch presentation next month, and am trying to squeeze in another talk for the PASS BI Virtual Chapter before the end of the year; I’ll post info for those events shortly.
Earlier this week, I delivered a presentation entitled “Deploying Report Builder 2.0 for Self-Service Reporting” as part of the new SQL Lunch series. This was my first presentation for this series, and I was pleasantly surprised with the turnout – thanks to the 110 people who showed up! We did have a technical snafu: apparently the presenter link was mistakenly sent out to some attendees, and we had several dozen people who were authorized as presenters, which caused some problems including a disconnect in the middle of the session. Fortunately, most if not everyone got reconnected quickly and I was able to finish on time without skipping any material. Sadly, the disconnect also shut off the recording, but I’ll consider presenting this session again since we had such a high level of interest.
I’ve found that I learn something new each time I present to a group, and this time was no exception. One of the attendees shared that Report Builder 3.0 is in CTP and available for download. Perhaps I’ll get started on the CTP and give another talk on it around the release of SQL Server 2008 R2. Ping me if you’d be interested in this…
The slide deck from Monday’s presentation is now available on my website.
So you’ve got some packages that regularly extract data to one or more text files, but you know that from time to time some of the queries will not return any data. However, you find in SSIS that, in a flat file export package, the output file is created regardless of whether any rows are written to the file, and in cases where there are no rows returned from the source, you’ll end up with empty data files in your output directory. Although not a critical failure, having empty output files can be a nuisance, and unfortunately, there aren’t any native settings in the Flat File Destination or the Flat File Connection Manager objects that will clean up empty files. Fortunately, SSIS does provide a combination of other tools that you can use to emulate this behavior.
In our example, I’m going to create a package to extract data from a table using a query for which I know that no rows would be returned. When I connect this data source to a flat file destination and execute the package, I’ll see an empty data file in my export directory. Next, to demonstrate the intended behavior, I’ll add a Row Count transformation to store the number of affected rows in a variable, and create a File System Task object to delete the output file. Finally, I’ll use an expression constraint to only delete the file if the variable attached to the row count is zero.
First, let’s create a test table to query:
USE testdb GO CREATE TABLE invoices ( invoiceid INT IDENTITY ( 1 , 1 ), vendorid INT, invoicedate DATETIME, invoiceamount DECIMAL(10,2)) INSERT invoices VALUES(12, '8/3/2009', 4125.50), (53, '8/13/2009', 1095.25), (46, '8/15/2009', 729.50), (33, '8/23/2009', 3421.50)
Now, I’ll create a basic package that will export to text the invoices for the past 30 days, a reasonable business requirement. Since we don’t currently have any invoices matching that criteria, we’ll end up with an empty output file. The original package is shown below:
Now for the new-and-improved version, I’m going to drop a Row Count transformation into the data flow to save the number of affected rows to a variable:
Finally, I’ll create a File System Task and configure it to delete the output file. To insure that a valid data file is not deleted, I’ll create a precedence constraint using an expression to only execute the delete if the row count variable is equal to 0:
When you execute this package, you’ll see that the File System Task object is executed because there are no rows matching our query. You can test the package by inserting another row into the database that will be returned by the query, and you’ll see that the data file is exported but not deleted.
I’ve attached the before and after SSIS packages if you’d like to take it for a test drive. Enjoy!
It's been well-documented through myriad blogs and forum posts about the case sensitivity of the comparisons in the SSIS lookup transformation (a good review can be found here). In a nutshell, a comparison using the lookup transformation is case sensitive when using the default setting of Full Cache, even if the values in the database are stored in a case insensitive collation. This happens because, in Full Cache mode, the comparison is done by the SSIS engine rather than the database engine, the former of which differentiates between "Value", "VALUE", and "VaLuE" regardless of the collation setting. But did you know that this transformation is space sensitive as well? Consider the following T-SQL code: SELECT Cast('char value' AS CHAR(20)) [val1] INTO test1 SELECT Cast('char value' AS CHAR(40)) [val2] INTO test2 SELECT t1.val1, t2.val2 FROM test1 t1 INNER JOIN test2 t2 ON t1.val1 = t2.val2 As you would expect, executing this code results in a successful match (INNER JOIN), even though we're comparing CHAR values of differing lengths (for more information, see this article for more information about spaces and padding in SQL Server). However, when the same comparison is run through an SSIS lookup transformation in Full Cache mode, the lookup on our sample data will not result in a match. Similar to the case sensitive lookup, you'll find that the SSIS engine would treat the strings 'Hello World' and 'Hello World ' (note the trailing spaces) as dissimilar values. Unlike in SQL Server, trailing whitespace is significant in SSIS value comparisons.
As a workaround, you can use the TRIM() function in SSIS and the RTRIM() T-SQL function to insure that your comparisons are ignorant of trailing whitespace. Alternatively, you could use a cache mode other than Full Cache, but you should be aware of the other implications before making such a change.
Note that this behavior is limited to fixed-length character fields, but could lead to some unexpected and hard-to-detect problems if you aren’t aware of the behavior.
In the next few months, I’ll be giving a couple of talks on SQL Server business intelligence. For October, I have the pleasure of presenting for SQL Lunch, a new online learning series pioneered by Patrick LeBlanc. On October 12th, I’ll be discussing ways to leverage SQL Server Report Builder 2.0 against your existing SSRS infrastructure to allow users to create their own ad-hoc and published reports.
Also, I’ll be visiting the Ft. Worth SQL Server User Group on November 18th. I’ll be discussing Intermediate SSIS, demonstrating some of the more complex tasks and transformations and demonstrating how to leverage these tools to assist with challenging ETL scenarios. Hope to see you there!
Recently, my friend Jack Corbett asked a question on Twitter:
In a nutshell, the SSIS foreach loop will enumerate a given list of items (files in a directory, nodes in an XML file, static list of values, etc.) and will perform some operation for each of the items in the collection. This behavior is similar to foreach loop constructs that are found in most high-level programming languages. These programming elements normally include a continue statement, which allows you to stop processing of the current item and move on to the next one. Jack hoped to find this capability built into the SSIS foreach loop, but unfortunately it doesn’t exist natively.
However, there is a workaround that will allow you to reproduce the continue statement in the SSIS foreach loop. By including a “dummy” data flow within the foreach loop, we can use a precedence constraint to skip some of the items in the list based on the value of a variable. To demonstrate, I’ll use a foreach loop to make a copy of all of the files in a given directory except for one specific file. I'll first set up the foreach loop in the control flow pane to enumerate over the files in a particular directory, by using the Foreach File Enumerator setting:
Now, within the foreach loop above, I create two data flows; one that does nothing at all, and the other that reads the current file in the list and writes the data out to a new file, the name of which is set using an expression.
Next comes the important part: I create a precedence constraint from my dummy data flow to the one that actually performs the work, and I’ll edit the expression to exclude one of the file names:
In this scenario, the package flow will only move from the dummy data flow to the one containing our business logic if the variable value, in this case the current filename, does not match the one we intend to skip.
So the short answer is that the SSIS foreach loop doesn’t have native capability to skip processing for certain values, but using a do-nothing container and a precedence constraint within the loop is a creative and effective substitute.
I see a lot of questions on the forums about updating existing data in SSIS. When the update is dynamic and is based on elements within your data flow, a common mistake is to use the Ole DB Command within the data flow pane:
The above poorly designed update has us retrieving data from some source, and then running an update on our target database based on one or more values in the source. Does it work? Sure. The problem occurs when you touch more than a handful of rows. When you bring in the Ole Db command into the picture in a data flow, you’ll be firing the statement in this control once for every row of data in your pipeline. So let’s say our source retrieves 100,000 rows into the data flow: the downstream UPDATE command will be executed 100,000 times! Such operations could bring the most capable server to its knees.
A better solution could include staging your data. Using this method, you’ll retrieve the data from the source and write it into a staging table in the destination database. You can then use an Execute SQL task to run your update in a more organic manner.
Set up your data flow as shown above to pull in the data to a staging table, then you can run a single UPDATE statement:
UPDATE i SET InvoiceAmount = st.UpdatedInvoiceAmount FROM Invoices i INNER JOIN StagedData st ON i.InvoiceID = st.InvoiceID
The advantage here is that you’re executing the expensive UPDATE statement once for each table rather than once for each row affected.
Of course, there are some situations that explicitly disallow the use of staging tables in destination systems. If storage or access restrictions keep you from using this method, you may have to use the row-by-row insert, so be aware that it's going to be a bottleneck.