I have gotten a number of emails over the past few days asking about how I import binary files into SSIS as well as how to improve throughput by making tasks parallel. I have, scattered throughout this blog, articles which show bits and pieces. I have articles on how to use the import file task, articles on how to use the Enhanced Threading Framework I put together, articles on using SHA-1 to find duplicates, etc. This is all well and good for the person who is looking for a specific piece of the puzzle, but what about the person who wants the whole puzzle? I put together a sample SSIS package that has all of the components (including sql scripts to build the database pieces):
The source can be downloaded here.
I have become accustom to modifying package variables within my Script Task using the VariablesDispenser object. It is convenient and powerful. I started working with this object due to issues I found while hacking my way through SSIS2005. I would put the variables in the ReadOnly or ReadWrite section and would get weird errors attributed to exclusive locking..blah,blah,blah. I naturally carried this practice over to SSIS2008 and things work the same; in theScript Task.
Getting heavily involved in SSIS development I broke down and decided to do more Script Component work. Prior to SSIS I had done my absolute best to keep away from VB.Net and along with that, Script Components. I just didn't like the way VB 'felt'. To be honest, there is nothing wrong with choosing or avoiding a language based upon personal preference, so long as the language chosen is proper for the job. Seeing C# as an option in SSIS2008, I decided that would be my language of choice and was once again excited about getting into Script Components to begin developing again.
My first foray back into programming Script Components, I had a hard time using the VariablesDispenser in theScript Component. I couldn't find the Variables object. Thinking there had to be an easier way, I went back to using the ReadOnly/ReadWrite Variables property. This wasn't much easier. Turns out, these variables can only be accessed within certain sections of your Script Component. ReadWrite variables are available through the Variables object in the PostExecute() method only:
ReadOnly variables are only available to the Variables object in the PreExecute() method. I also found some interesting bugs regarding this (see previous post script gotchas). This worked fine for me, but being more stubborn than smart, I wanted to figure out a way to access these variables the way I was most comfortable, using the VariablesDispenser. A few sleepless nights later I found what I was looking for; IDTSVariables100 object.
You no longer access the VariableDispenser object through Dts and you no longer pass in the a variable of typeVariables. Instead you access the package level variables directly through VariableDispenser and pass all of the locked package variables into a local variable of type IDTSVariables100. "So what does this buy me?". First, I am not constrained to have my ReadOnly variables accessible only in thePreExecute() method and my ReadWrite variables accessible only in the PostExecute() method. This means I don't need local variables to pass information around or keep counts of things to later push out. This is really only useful inScript Source and Script Destination objects where you don't loop through the methods for each row. It's also easier to copy/past the code in the above example to move to it multiple scripts vs having to hand code each script'sReadOnly/ReadWrite section. Last, but not least; personal preference.
You no longer access the VariableDispenser object through Dts and you no longer pass in the a variable of typeVariables. Instead you access the package level variables directly through VariableDispenser and pass all of the locked package variables into a local variable of type IDTSVariables100.
"So what does this buy me?". First, I am not constrained to have my ReadOnly variables accessible only in thePreExecute() method and my ReadWrite variables accessible only in the PostExecute() method. This means I don't need local variables to pass information around or keep counts of things to later push out. This is really only useful inScript Source and Script Destination objects where you don't loop through the methods for each row. It's also easier to copy/past the code in the above example to move to it multiple scripts vs having to hand code each script'sReadOnly/ReadWrite section. Last, but not least; personal preference.
Conditional Split is a very powerful component, enabling you to send records in different directions or exclude them from the downstream altogether. I use this component quite often in my SSIS packages. In the industry I work in, publishing, there are many times we'll get a file in from an aggregation company or a large distributor of our books (like Amazon). Many times the ISBN is poorly formatted and this might be for any number of reasons. A good example of this is we generally get this aggregated information via excel sheet (yes, because humans love excel) and if you don't format the sheet right, an ISBN10, which starts '00', will have the leading zeros truncated. We might have an isbn with the dashes in place, much like you see on the Barnes&Noble site. We have processes to scrub this data, but in the end they might have bad data, so I use a conditional split to extract this data out and push it to an error section, be it an error template we mail back to a person and/or a local table that we store this information in. I have found at times embedded control totals.
We love control totals as it gives us numbers to compare our process to. If we add up a bunch of numbers and our numbers match the control totals given, we know we are doing things the same way the aggregator is. This is wonderful, except we do not want to control totals in the item level table, that is what will will sum to match control totals. The control totals go into a different table. This is where a gotcha with Conditional Split comes in.
Have you ever noticed the numbers on the left hand side and the up and down arrows on the right of the component. Neither did I until I got bit. Conditional Split component compares each condition in that order, top to bottom, first to last. This means that if you have a very restrictive conditional after a less restrictive condition, you run the risk of the first condition (less restrictive) meeting the criteria of the data you wanted for the more restrictive. This means you data will go down the wrong path. Let me explain again, with pictures:
It turns out that in the excel source, all of the columns except C103_NetAmount are null. That means the InvalidISBN will match the columns that are destined for GrandTotal. The GrandTotal condition is much more restrictive than the BadISBN condition. The simple way to fix this is to select GrandTotal and then click the up_arrow. This will move the GrandTotal condition up to number 1. The result is below:
Remember to put your conditions in top to bottom order; Most restrictive to least restrictive. I also tend to put the fewest number of checks within the condition. I have nearly a dozen columns in the table and I could have listed the value of each one as it pertains to the condition, but why? I select a column or two that meet the condition and use it. In this case there was an overlap, so I simply rearranged the order. This is much simpler than have a 10+ condition. Very hard to read and potentially hard to maintain.
In a previous post I showed you how to access variables from within an SSIS script component. More specifically I showed you how to access variables from within an Script Destination Component. You can not access the read/write variables using the standard Dts.VariableDispenser that you are familiar with in the Script Task (from the Control Flow tab). There is a variation on that which allows you to access read_only variables (which I will blog about soon). The original image I showed, looked like this:
I have a few variables that I want to modify, so I put them in the ReadWriteVariables section and a few variables that I want to only read from, which I put in the ReadOnlyVariables section. The code I was using was simple enough:
Therein lies the problem. It turns out that putting any variable in the ReadOlyVariables section that you do not actively use, such as in the example below:
will cause a SQLDUMPER error to be generated. It took me a few hours to figure out what it was and to find it. The error is located here: C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\SQLDUMPER_ERRORLOG.log The actual output, which provides no information that I can figure out, looks like this: How do you solve the issue? Put any and all variables you plan on using within your Script Destination Component in the proper section. Don't look at the variables in terms of their actual 'properties', that is to say even if the variable you are using should be read_only or read/write. If you plan on using them in the PostExecute() method, you must register those variables in the ReadWriteVariables section. If you plan on using the variables in the PreExecute() method, then place the variables in theReadOnlyVariables section and you must use them. I have not checked to see if the reverse is true, that is putting a variable in the ReadWriteVariables section, but not using it in the PostEecute() method causes the same issue, but I would use the rule I hinted at above.
will cause a SQLDUMPER error to be generated. It took me a few hours to figure out what it was and to find it. The error is located here:
C:\Program Files\Microsoft Visual Studio 9.0\Common7\IDE\SQLDUMPER_ERRORLOG.log
The actual output, which provides no information that I can figure out, looks like this:
How do you solve the issue? Put any and all variables you plan on using within your Script Destination Component in the proper section. Don't look at the variables in terms of their actual 'properties', that is to say even if the variable you are using should be read_only or read/write. If you plan on using them in the PostExecute() method, you must register those variables in the ReadWriteVariables section. If you plan on using the variables in the PreExecute() method, then place the variables in theReadOnlyVariables section and you must use them. I have not checked to see if the reverse is true, that is putting a variable in the ReadWriteVariables section, but not using it in the PostEecute() method causes the same issue, but I would use the rule I hinted at above.
I am excited to say I have been published on Simple-Talk. This was my first foray into being professionally edited. I will say the experience was not what I imagined. I guess everyone has different ideas of what it is like to have an editors review your work. I will say that my writing follows the 'train-of-thought' method (I think it's just how I'm wired), so my grammar did not go unnoticed. One of the most interesting things I learned out of this process is to take into consideration foreign readers. Many of the grammar comments explained that to a reader where English is a second language the wording might be confusing. The technical review also shed some light on what I wrote vs. what I was trying to accomplish. That second set of eyes helped me realize I took quite of bit of knowledge for granted, so I had to revise many of my descriptions and explanations. Would I write again? Most definitely! The hard part for me is finding the idea for the article. After that I just let the idea pour onto the paper and start the editorial process all over again.
I have been trying for a few days to implement a Service Broker implementation to, service the threading framework I built for SSIS. When you build the threading framework you have a set number of threads running and I have been looking for a way to modify the number of running threads to accommodate server load. There are many ways to do this, such as have the threads check a table and turn on/off, but I don't care for that method. The issue I have is you need to have the thread, once it is 'asleep' to wake up every so often and poll the table. That's a waste of resources in my opinion. I would prefer to have a callback mechanism where the thread is put to sleep and woken up when it is time to come back online. To do this I choose Service Broker. Threads will 'wait' on a queue and wake up when they should or never if they are to be kept offline. No table polling, no contention, no wasted resources.
I was all Gung-Ho. I had tested my Service Broker implementation with two procedures to simulate a request to throttle the engines. I had a slight set-back, but @mrDenny was kind enough to jump in a give me the necessary information that every resource I could find failed to explain. I was ready to implement my Service Broker trigger based implementation.
Whoa did the frustration ensue. The concept was simple. I had a table which contained a listing of what you could request of the framework, which at the moment is a certain number of engines on/off, based upon the number of engines in the framework (can't ask for more than are currently built). Realizing that the engines could only check the queue once they finished their work, I didn't want to have a person request taking X engines offline and wait N minutes for the request to complete. I wanted the table to update and return immediately then have the Service Broker asynchronously handle waiting. What followed was a miserable fail.
I had the trigger built exactly like my test procedures, yet every time I did an update to the table, the trigger would hang on sending the message, not on the wait. I spent a few hours racking my brains trying to understand this. I was checking queues, I was checking error logs (thanks to@AaronBertrand), but couldn't figure out why. I hit the web and found there were a few people in the same boat as I was and they had no answers to their questions. Dig a little deeper and I found the only way to make a trigger asynchronous is do make a CLR trigger. My next task.
Spent a few more hours pouring through the online msdn library looking for clr information (here is a good starting point http://msdn.microsoft.com/en-us/library/ms131093.aspx) and TADAH!. I created a clr trigger and it works like a charm. I do an update to the table and I get an instant return, allowing the user to continue on.
I now have to integrate this into threading framework, but that won't be too hard. The framework uses a stored procedure to dole out the work, so I just have to put the SB code in there, vs having to modify my SSIS package. If you are unfamiliar with the threading framework I designed for SSIS, you can download the deck here (http://josef-richberg.squarespace.com/downloads/) and a recorded presentation here http://appdev.sqlpass.org/MeetingArchive/tabid/2005/Default.aspx.
I am not sure yet if this will be a blog post (it will be very large) or an article (more to come on that).
In a previous post I explained how to trap for errors in your Script Task and Components and send the information to the SSIS Log (link). A good friend of mine and system engineer extraordinaire Eugene L. (last name withheld to protect the guilty :) ) suggested that I also push these errors to the windows event log. We have HP OpenView which is monitored by our operations group, which will enable us to setup alerts through them as well. This sounded simple enough and after about 5 min of research it turns out it is simple. All you need to do is add an additional class and one additional object.
First you need to add, using System.Diagnostics. This provides the hook to the classes that will enable you to push your error to the Windows Event Log. Now create the object to interact with it:
EventLog evLog = new EventLog("Application", System.Environment.MachineName, "ImprovedTidalExtract");
I set up the header for the log. This is an "application", being run on "System.Environment.MachineName", and the application name is "ImprovedTidalExtract".
Now I have to provide the detail information, that is the actual error. Within the Catch block, I add an additional line.
How will this look within the event log? Like this:
The detailed entry I created with the above try/catch looks like this:
Using a Try/Catch/Finally within your Script Components and Script Tasks is just good coding practice. The thing you want to be able to do is push any errors you trap to a log file. The syntax is somewhat simple, but it took me some time to research it and it differs depending upon whether it's aComponent or Task. I'll go over each.
Within the Component there is no access to the Dts object, so you need to work through theComponentMetaData object as shown below. Please note the pbCancel is defined as a bool at the top of my Component.
I am always looking to improve the performance of my SSIS packages. I try to do as much work within the package as possible, only going to external stores (databases, flat/raw files) as little as possible. There are a number of reasons for this. Raw files are incredibly fast and very good when you need to have multiple stages within your package. I use Raw Files to store about 20 GB of data which is broken out among 8 threads. This was necessary for a number of reasons one if them being the ability to reload the tables from this raw data if we ran out of time during the initial load window. In SSIS you can store the result set of a stored procedure in a System.object, which is translated into an ADO object. I noticed the only SSIS component that can use this source is the Foreach Loop. I can loop through the variable and process the results. At the bottom of this post I linked to the request to add the ability for the Data Flow ADO.NET source/destination to use a System.Object. I have a few situations where I have a flat file and would like to read that into memory for processing. Currently reading in a flat file the traditional way, with a Data Flow task, only allows me to push that into a data source. I dug a little deeper into the SSIS script component and found a way to read the flat file from within a Script Component into an ADO component, which can then be read by a Foreach Loop. If I get enough people to vote for the enhancement this technique can be used to eventually build in-memory data sets, start to finish. Let's dissect the C# component I wrote.
I am always looking to improve the performance of my SSIS packages. I try to do as much work within the package as possible, only going to external stores (databases, flat/raw files) as little as possible. There are a number of reasons for this. Raw files are incredibly fast and very good when you need to have multiple stages within your package. I use Raw Files to store about 20 GB of data which is broken out among 8 threads. This was necessary for a number of reasons one if them being the ability to reload the tables from this raw data if we ran out of time during the initial load window.
In SSIS you can store the result set of a stored procedure in a System.object, which is translated into an ADO object. I noticed the only SSIS component that can use this source is the Foreach Loop. I can loop through the variable and process the results. At the bottom of this post I linked to the request to add the ability for the Data Flow ADO.NET source/destination to use a System.Object.
I have a few situations where I have a flat file and would like to read that into memory for processing. Currently reading in a flat file the traditional way, with a Data Flow task, only allows me to push that into a data source. I dug a little deeper into the SSIS script component and found a way to read the flat file from within a Script Component into an ADO component, which can then be read by a Foreach Loop.
If I get enough people to vote for the enhancement this technique can be used to eventually build in-memory data sets, start to finish.
Let's dissect the C# component I wrote.
Make sure you include using System.Data in your script. This is where all of the ADO.NET objects live. Now, you need to create a table structure to store the data. According to the API you can name your table, but I have yet to find a need for it:
DataTable dt = new DataTable("Parent_Child");
Now you need to create columns for your table. In my case I need a two columns, one for the parentID and one for the ChildID. Since I am reading in text fields and figured I can always case on insert into SQL Server.
Now you have your table all ready to receive data, you need to lock your SSIS variable so you can access this data outside of the script. I happen to have named my variable test_obj, which is of course of type System.Object.
Dts.VariableDispenser.LockForWrite("User::test_obj");
The script image above shows the technique to load up the local object, DataTable dt. Once that is completed you pass the local object back out to the SSIS variable and unlock the variables like so:
Please click on the link below and vote to enhance SSIS to enable ADO source/destination to accept variables.
SSIS Enhancement